Temporal Exploratory Data Analysis¶

In [1]:
%load_ext autoreload
%autoreload 2
In [2]:
import os
import sys
import warnings
from calendar import day_name
from glob import glob
from itertools import product
from typing import Dict

import altair as alt
import duckdb
import pandas as pd
from scipy.stats import linregress
from watermark import watermark
In [3]:
_ = alt.data_transformers.disable_max_rows()
_ = alt.renderers.set_embed_options(actions=False)
In [4]:
PROJ_ROOT = os.pardir
src_dir = os.path.join(PROJ_ROOT, "src")
sys.path.append(src_dir)
In [5]:
%aimport file_utils
import file_utils as flut

%aimport open_data
import open_data as od

%aimport pandas_utils
import pandas_utils as pu

%aimport visualization_helpers
import visualization_helpers as vzu

About¶

Objective¶

The processed bike share ridership data is explored in order to understand

  1. high-level historical network performance
  2. attributes of the stations identified to be top-performing stations using historical data and to be prioritized in the advertising campaign that is part of the business use-case for this project
  3. user behaviour by studying temporal patterns in historical bike share ridership data

Data¶

The following previously-created datasets are used in this exploratory data analysis

  1. daily weather data (raw)
  2. list of downtown neighbourhoods (raw)
  3. station info (includes geodata, raw)
  4. network expansion plans (raw)
  5. public holidays in Canada (raw)
  6. processed bike share ridership (raw)
  7. identification of bike share stations as top-performers or not (processed)

Notes¶

  1. This is the first part of a two-part EDA for this project. The next step contains the second part of the EDA, in which geospatial patterns in the Bike Share Toronto network will be studied.
  2. Using temporal insights, recommendations will be made in order to maximize exposure while also minimizing costs, as required by the client and discussed in the project scope. See the project's proposal from the project wiki for details about the scope.

Assumptions¶

  1. Same as in data retrieval and processing steps.

Outputs¶

  1. Charts showing EDA are produced.
  2. Recommendations file with logic to filter currently active stations based on temporal insights.

User Inputs¶

In [6]:
# ridership
years_proc_trips = {
    2018: [f'Q{k}' for k in range(1, 4+1)],
    2019: [f'Q{k}' for k in range(1, 4+1)],
    2020: [f'{str(k).zfill(2)}' for k in range(1, 12+1)],
    2021: [f'{str(k).zfill(2)}' for k in range(1, 12+1)],
    2022: [f'{str(k).zfill(2)}' for k in range(1, 12+1)],
    2023: [f'{str(k).zfill(2)}' for k in range(1, 3+1)],
}

# exploring turning (inflection) point
years_temp_dependence = [2018, 2019, 2020, 2021, 2022]

# plotting
axis_label_fontsize = 14

# exporting to disk
my_timezone = 'America/Toronto'
In [7]:
data_dir = os.path.join(PROJ_ROOT, 'data')
raw_data_dir = os.path.join(data_dir, 'raw', 'systems', 'toronto')
processed_data_dir = os.path.join(data_dir, 'processed')
reports_dir = os.path.join(PROJ_ROOT, 'reports')
figures_dir = os.path.join(reports_dir, 'figures')

# processed trips
fpaths_proc = {
    y: [
        f
        for p in periods
        for f in sorted(
            glob(
                os.path.join(
                    processed_data_dir,
                    f'processed__trips_{y}_{p}*.parquet.gzip',
                )
            )
        )
    ]
    for y, periods in years_proc_trips.items()
}
fpaths_proc_all = [f for _, v in fpaths_proc.items() for f in v]
fpaths_proc_2018_2022 = [f for y in range(2018, 2022+1) for f in fpaths_proc[y]]

# downtown neighbourhoods
fpath_downtown_neighs = glob(
    os.path.join(raw_data_dir, 'downtown_neighbourhoods__*.parquet.gzip')
)[0]

# expansion plans
fpath_expansion = glob(
    os.path.join(raw_data_dir, 'network_expansion__*.parquet.gzip')
)[0]

# station info for currently active stations
fpath_stations_info = glob(
    os.path.join(raw_data_dir, 'stations_info__*.parquet.gzip')
)[0]

# daily weather data
fpath_weather = glob(
    os.path.join(raw_data_dir, 'daily_weather__*.parquet.gzip')
)[0]

# holidays
fpath_holidays = glob(
    os.path.join(raw_data_dir, 'holidays__*.parquet.gzip')
)[0]

# top performing stations
fpath_top_stations = glob(
    os.path.join(processed_data_dir, 'stations_performance__*.parquet.gzip')
)[0]
In [8]:
def run_sql_query(query: str, verbose: bool=False) -> pd.DataFrame:
    """Run SQL query using DuckDB."""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", FutureWarning)
        df_query = duckdb.sql(query).df()
    if verbose:
        print(f"Query returned {len(df_query):,} rows")
    return df_query

Get Data¶

Daily Weather Data¶

Show the previously retrieved daily weather data

In [9]:
%%time
query = f"""
        SELECT *
        FROM read_parquet({[fpath_weather]})
        """
df_weather = run_sql_query(query).convert_dtypes()
with pd.option_context('display.max_columns', None):
    pu.show_df(df_weather)
column time tavg tmin tmax prcp snow wdir wspd wpgt pres tsun station_id
dtype datetime64[us] Float64 Float64 Float64 Float64 Int64 Int64 Float64 Int64 Float64 Int64 string[python]
nunique 1915 416 404 440 150 29 355 286 59 347 0 1
missing 0 0 0 0 7 1448 75 9 721 15 1915 0
0 2018-01-01 -15.0 -21.3 -8.7 0.0 110 <NA> 19.5 <NA> 1029.1 <NA> 71624
1 2018-01-02 -10.5 -13.1 -7.8 1.0 110 241 25.6 <NA> 1025.0 <NA> 71624
2 2018-01-03 -9.9 -13.5 -6.3 0.0 100 235 23.2 <NA> 1017.4 <NA> 71624
3 2018-01-04 -14.7 -20.5 -8.9 0.0 90 314 27.5 <NA> 1013.8 <NA> 71624
4 2018-01-05 -19.0 -23.0 -15.0 0.0 80 311 26.7 <NA> 1019.1 <NA> 71624
... ... ... ... ... ... ... ... ... ... ... ... ...
1910 2023-03-27 2.9 2.0 5.0 4.9 <NA> 55 9.4 <NA> 1018.0 <NA> 71624
1911 2023-03-28 3.5 1.0 6.0 0.1 <NA> 234 10.9 <NA> 1021.9 <NA> 71624
1912 2023-03-29 1.5 -3.9 8.0 1.6 <NA> 250 24.6 <NA> 1018.4 <NA> 71624
1913 2023-03-30 -1.3 -6.5 4.0 0.0 <NA> 251 14.0 <NA> 1026.8 <NA> 71624
1914 2023-03-31 2.4 0.0 4.4 8.3 <NA> 108 9.9 <NA> 1015.0 <NA> 71624

1915 rows × 12 columns

CPU times: user 24.1 ms, sys: 187 µs, total: 24.3 ms
Wall time: 23.7 ms

Downtown Neighbourhoods¶

Show previously retrieved neighbourhoods within downtown Toronto

In [10]:
df_downtown_neighs = pd.read_parquet(fpath_downtown_neighs)
pu.show_df(df_downtown_neighs)
column Neighbourhood Location is_downtown
dtype string[python] string[python] boolean
nunique 24 3 2
missing 0 0 0
0 University Downtown True
1 Kensington-Chinatown Downtown True
2 Wellington Place Downtown True
3 Harbourfront-CityPlace Downtown True
4 Bay-Cloverhill Downtown True
5 Yonge-Bay Corridor Downtown True
6 St Lawrence-East Bayfront-The Islands Downtown True
7 Church-Wellesley Downtown True
8 Downtown Yonge East Downtown True
9 North St.James Town Downtown True
10 Cabbagetown-South St.James Town Downtown True
11 Moss Park Downtown True
12 Regent Park Downtown True
13 Roncesvalles West of Downtown False
14 South Parkdale West of Downtown False
15 Dufferin Grove West of Downtown False
16 Little Portugal West of Downtown False
17 Palmerston-Little Italy West of Downtown False
18 Trinity-Bellwoods West of Downtown False
19 West Queen West West of Downtown False
20 Fort York-Liberty Village West of Downtown False
21 North Riverdale East of Downtown False
22 South Riverdale East of Downtown False
23 Blake-Jones East of Downtown False

Bike Share Station Info (MetaData)¶

Show the stations info data that was retrieved previously, containing station name and its associated neighbourhood name

In [11]:
%%time
query = f"""
        SELECT station_id,
               name,
               physical_configuration,
               capacity,
               is_charging_station,
               rental_methods LIKE '%CREDITCARD%' AS credit,
               Neighbourhood,
               COALESCE(Location, NULL, 'Others') AS Location,
               COALESCE(is_downtown, NULL, False) AS is_downtown,
               census_tract_id
        FROM read_parquet({[fpath_stations_info]})
        LEFT JOIN df_downtown_neighs USING (Neighbourhood)
        -- WHERE physical_configuration <> 'VAULT'
        ORDER BY station_id, name
        """
df_info = run_sql_query(query).convert_dtypes()
with pd.option_context('display.max_columns', None):
    pu.show_df(df_info)
column station_id name physical_configuration capacity is_charging_station credit Neighbourhood Location is_downtown census_tract_id
dtype string[python] string[python] string[python] Int64 boolean boolean string[python] string[python] boolean string[python]
nunique 790 790 6 40 2 2 107 4 2 272
missing 0 0 0 0 0 0 0 0 0 0
0 7000 Fort York Blvd / Capreol Ct REGULAR 35 False True Harbourfront-CityPlace Downtown True 5350012.01
1 7001 Wellesley Station Green P ELECTRICBIKESTATION 23 True True Church-Wellesley Downtown True 5350063.06
2 7002 St. George St / Bloor St W REGULAR 19 False True University Downtown True 5350061.00
3 7003 Madison Ave / Bloor St W REGULAR 15 False True Annex Others False 5350091.01
4 7005 King St W / York St REGULAR 23 False True Yonge-Bay Corridor Downtown True 5350014.00
... ... ... ... ... ... ... ... ... ... ...
785 7926 McRae Dr / Laird Dr - SMART SMARTMAPFRAME 24 False False Leaside-Bennington Others False 5350195.02
786 7927 Strachan Ave / East Liberty St - SMART SMARTMAPFRAME 24 False False Fort York-Liberty Village West of Downtown False 5350008.01
787 7928 Simcoe St / Pullan Pl REGULAR 31 False True Kensington-Chinatown Downtown True 5350036.00
788 7929 Spadina Ave / Bulwer St- SMART SMARTMAPFRAME 12 False False Kensington-Chinatown Downtown True 5350039.00
789 7930 Dunn Ave / Queen St W REGULAR 15 False True South Parkdale West of Downtown False 5350005.00

790 rows × 10 columns

CPU times: user 20.2 ms, sys: 201 µs, total: 20.4 ms
Wall time: 19.3 ms

Bike Share Network Expansion Plans¶

Show previously retrieved bike share network expansion plans

In [12]:
%%time
query = f"""
        SELECT *,
               NULL AS frac_ctracts_with_bikeshare
        FROM read_parquet({[fpath_expansion]})
        """
df_network_size = run_sql_query(query).convert_dtypes()
with pd.option_context('display.max_columns', None):
    pu.show_df(df_network_size)
column year trips num_stations num_bikes frac_neighs_with_bikeshare frac_ctracts_with_bikeshare
dtype Int64 Int64 Int64 Int64 Int64 Int64
nunique 3 3 3 3 0 0
missing 0 0 0 0 3 3
0 2023 5800000 820 8110 <NA> <NA>
1 2024 7000000 930 9055 <NA> <NA>
2 2025 8200000 1045 10000 <NA> <NA>
CPU times: user 9.16 ms, sys: 897 µs, total: 10.1 ms
Wall time: 9.59 ms

Public Holidays¶

Show the previously retrieved list of public holidays in Canada

In [13]:
%%time
query = f"""
        SELECT *,
               (
                   CASE WHEN DAYNAME(date) IN ('Sunday', 'Saturday')
                   THEN 'Monday'
                   ELSE DAYNAME(date)
                   END
               ) AS weekday
        FROM read_parquet({[fpath_holidays]})
        """
df_holidays = run_sql_query(query).convert_dtypes()
with pd.option_context('display.max_columns', None):
    pu.show_df(df_holidays)
column date holiday_name is_holiday weekday
dtype datetime64[us] string[python] boolean string[python]
nunique 48 13 1 5
missing 0 0 0 0
0 2018-01-01 New Year's Day True Monday
1 2018-03-30 Good Friday True Friday
2 2018-07-01 Canada Day True Monday
3 2018-09-03 Labor Day True Monday
4 2018-12-25 Christmas Day True Tuesday
5 2018-02-19 Family Day True Monday
6 2018-05-21 Victoria Day True Monday
7 2018-10-08 Thanksgiving Day True Monday
8 2018-12-26 Boxing Day True Wednesday
9 2019-01-01 New Year's Day True Tuesday
10 2019-04-19 Good Friday True Friday
11 2019-07-01 Canada Day True Monday
12 2019-09-02 Labor Day True Monday
13 2019-12-25 Christmas Day True Wednesday
14 2019-02-18 Family Day True Monday
15 2019-05-20 Victoria Day True Monday
16 2019-10-14 Thanksgiving Day True Monday
17 2019-12-26 Boxing Day True Thursday
18 2020-01-01 New Year's Day True Wednesday
19 2020-04-10 Good Friday True Friday
20 2020-07-01 Canada Day True Wednesday
21 2020-09-07 Labor Day True Monday
22 2020-12-25 Christmas Day True Friday
23 2020-02-17 Family Day True Monday
24 2020-05-18 Victoria Day True Monday
25 2020-10-12 Thanksgiving Day True Monday
26 2020-12-26 Boxing Day True Monday
27 2021-01-01 New Year's Day True Friday
28 2021-04-02 Good Friday True Friday
29 2021-07-01 Canada Day True Thursday
30 2021-09-06 Labor Day True Monday
31 2021-12-25 Christmas Day True Monday
32 2021-12-27 Christmas Day (observed) True Monday
33 2021-02-15 Family Day True Monday
34 2021-05-24 Victoria Day True Monday
35 2021-10-11 Thanksgiving Day True Monday
36 2021-12-26 Boxing Day True Monday
37 2021-12-28 Boxing Day (observed) True Tuesday
38 2022-01-01 New Year's Day True Monday
39 2022-01-03 New Year's Day (observed) True Monday
40 2022-04-15 Good Friday True Friday
41 2022-07-01 Canada Day True Friday
42 2022-09-05 Labor Day True Monday
43 2022-12-25 Christmas Day True Monday
44 2022-12-26 Boxing Day; Christmas Day (observed) True Monday
45 2022-02-21 Family Day True Monday
46 2022-05-23 Victoria Day True Monday
47 2022-10-10 Thanksgiving Day True Monday
CPU times: user 9.8 ms, sys: 1.38 ms, total: 11.2 ms
Wall time: 10.5 ms

Top-Performing Stations¶

Show previously classified stations as top-performers or not

In [14]:
df_stations = pd.read_parquet(fpath_top_stations)
pu.show_df(df_stations)
column station_id name physical_configuration capacity is_charging_station credit Neighbourhood Location census_tract_id is_active ... is_top_perform_station_weekday departures_weekend_last_year arrivals_weekend_last_year departures_weekend_last_n_years arrivals_weekend_last_n_years rank_weekend_deps_last_year rank_weekend_deps_last_n_years rank_weekend_arrs_last_year rank_weekend_arrs_last_n_years is_top_perform_station_weekend
dtype Int64 string[python] string[python] Int64 boolean boolean string[python] string[python] string[python] boolean ... boolean Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 boolean
nunique 627 627 6 39 2 2 83 4 214 1 ... 2 570 576 601 598 570 601 576 598 2
missing 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
0 7076 York St / Queens Quay W REGULAR 57 False True Harbourfront-CityPlace Downtown 5350012.04 True ... True 14168 16020 49454 54606 1 2 1 1 True
1 7016 Bay St / Queens Quay W (Ferry Terminal) REGULAR 35 False True St Lawrence-East Bayfront-The Islands Downtown 5350013.02 True ... True 10548 10835 40142 42497 5 5 5 5 True
2 7033 Union Station REGULAR 43 False True St Lawrence-East Bayfront-The Islands Downtown 5350013.01 True ... True 6026 7354 17353 20216 35 46 20 33 True
3 7175 HTO Park (Queens Quay W) REGULAR 27 False True Harbourfront-CityPlace Downtown 5350012.04 True ... True 11206 13941 41840 47634 3 4 2 3 True
4 7203 Bathurst St/Queens Quay(Billy Bishop Airport) REGULAR 35 False True Fort York-Liberty Village West of Downtown 5350008.02 True ... True 8970 10015 38856 41360 8 6 8 6 True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
622 7156 Salem Ave / Bloor St W REGULAR 15 False True Dovercourt Village Others 5350096.02 True ... False 1655 1574 6196 5856 276 232 282 246 False
623 7600 Ursula Franklin St / Huron St - SMART SMARTMAPFRAME 20 False False University Downtown 5350061.00 True ... False 953 740 1605 1254 390 467 423 486 False
624 7414 Keele St / Annette St REGULAR 15 False True Junction Area Others 5350101.00 True ... False 694 687 2255 2283 442 412 431 401 False
625 7622 Marie Curtis Park REGULAR 23 False True Long Branch Others 5350206.01 True ... False 741 756 2248 2347 434 413 420 398 False
626 7694 Victoria Park Ave / Dawes Rd REGULAR 19 False True O'Connor-Parkview Others 5350190.02 True ... False 133 130 133 130 583 594 584 595 False

627 rows × 38 columns

Processed Bike Share Ridership Data¶

Show the first three rows of the file of processed bike share ridership for August of 2022

In [15]:
%%time
query = f"""
        SELECT *
        FROM read_parquet({[fpaths_proc[2022][7]]})
        WHERE started_at_year = 2022
        AND started_at_month = 8
        LIMIT 3
        """
df_proc_trips_preview = run_sql_query(query).convert_dtypes()
pu.show_df(df_proc_trips_preview)
column trip_id start_station_id started_at start_station_name end_station_id ended_at end_station_name bike_id user_type started_at_year started_at_month started_at_day started_at_hour started_at_minute ended_at_year ended_at_month ended_at_day ended_at_hour ended_at_minute
dtype Int64 Int32 datetime64[us] string[python] Int64 datetime64[us] string[python] Int64 string[python] Int32 Int32 Int32 Int32 Int32 Int32 Int32 Int32 Int32 Int32
nunique 3 2 1 2 2 2 1 3 1 1 1 1 1 1 1 1 1 1 2
missing 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0
0 17515458 7259 2022-08-01 Lower Spadina Ave / Lake Shore Blvd 7712 2022-08-01 00:20:00 <NA> 3328 Casual Member 2022 8 1 0 0 2022 8 1 0 20
1 17515440 7208 2022-08-01 80 Clinton St North of College 7534 2022-08-01 00:07:00 Walnut Ave / Queen St W 4662 Casual Member 2022 8 1 0 0 2022 8 1 0 7
2 17515442 7259 2022-08-01 Lower Spadina Ave / Lake Shore Blvd 7712 2022-08-01 00:20:00 <NA> 4510 Casual Member 2022 8 1 0 0 2022 8 1 0 20
CPU times: user 78.8 ms, sys: 6.14 ms, total: 84.9 ms
Wall time: 51 ms

Notes

  1. The above contents come from the file containing processed bike share ridership for August of 2022. The file with processed data for all other months (2020, 2021, 2022 and 2023) and quarters (2018, 2019) contain the same column names.

Show Overall Network Performance¶

Get a high-level summary of historical station performance and future plans by calculating the following metrics

  1. total ridership (number of trips) across entire
  2. total number of bikes used in ridership across entire network
  3. total number of stations used in ridership across entire network
  4. fraction of all neighbourhoods with at least one bike share station

using the following approach

  1. get stations and bikes used in yearly departures, and number of departures (used as trips), between 2018 and 2022
  2. get stations and bikes used in yearly arrivals between 2018 and 2022
  3. Get yearly totals for departures (1)
  4. Get yearly totals for arrivals (2)
  5. Combine yearly totals for departures (4) and arrivals (5) using UNION
  6. Get yearly totals (from 5) for
    • number of trips taken (departures)
    • bikes used in historical bike share trips
    • stations used in historical bike share trips
  7. Combine overall historical totals (6) with network's future expansion plans
  8. Add date columns showing the year covering the start and end of future expansion plans
In [16]:
%%time
query = f"""
        -- 1. get pre-2023 yearly departure-related totals (up to Dec 31, 2022) per station
        WITH t1 AS (
            SELECT start_station_id AS station_id,
                   started_at_year AS year,
                   COUNT(DISTINCT(trip_id)) AS trips,
                   COUNT(DISTINCT(bike_id)) AS bikes_deps
            FROM read_parquet({fpaths_proc_all})
            WHERE started_at_year <= 2022
            -- (OPTIONAL) exclude trips missing a start station name
            -- WHERE start_station_name IS NOT NULL
            GROUP BY all
        ),
        -- 2. get pre-2023 yearly arrival-related totals (up to Dec 31, 2022) per station
        t2 AS (
            SELECT end_station_id AS station_id,
                   ended_at_year AS year,
                   COUNT(DISTINCT(bike_id)) AS bikes_arrs
            FROM read_parquet({fpaths_proc_all})
            WHERE started_at_year <= 2022
            -- (OPTIONAL) exclude trips missing a end station name
            -- WHERE end_station_name IS NOT NULL
            GROUP BY all
        ),
        -- 3. Aggregate pre-2023 yearly departures to get the following
        -- count the stations with departures
        -- count the number of bikes used for arrivals
        -- sum the number of departures (trips)
        t3 AS (
            SELECT year,
                   COUNT(distinct(station_id)) AS num_stations,
                   MAX(bikes_deps) AS bikes,
                   SUM(trips) AS trips,
                   'departures' AS type
            FROM t1
            GROUP BY year
        ),
        -- 4. Aggregate pre-2023 yearly arrivals to get the following
        -- count the stations with arrivals
        -- count the number of bikes used for arrivals
        t4 AS (
            SELECT year,
                   COUNT(distinct(station_id)) AS num_stations,
                   MAX(bikes_arrs) AS bikes,
                   NULL AS trips,
                   'arrivals' AS type
            FROM t2
            GROUP BY year
        ),
        -- 5. combine pre-2023 counts for departures and arrivals
        t5 AS (
            SELECT *
            FROM t3
            UNION ALL
            SELECT *
            FROM t4
        ),
        -- 6. get overal trips, stations, and bikes with bike share trips
        t6 AS (
            SELECT year,
                   SUM(trips) AS trips,
                   MAX(num_stations) AS num_stations,
                   MAX(bikes) AS num_bikes
            FROM t5
            GROUP BY ALL
            ORDER BY year
        ),
        -- 7. combine the yearly totals with the network expansion plans starting in 2023
        t7 AS (
            SELECT *
            FROM t6
            WHERE year < 2023
            UNION
            SELECT year,
                   trips,
                   num_stations,
                   num_bikes
            FROM df_network_size
            WHERE year >= 2023
            ORDER BY year
        ),
        -- 8. add indicator of future expansion plans
        t8 AS (
            SELECT *,
                   2023 AS shading_start,
                   2025 AS shading_stop
            FROM t7
        )
        SELECT *
        FROM t8
        """
df_yearly_summary = run_sql_query(query).convert_dtypes()
pu.show_df(df_yearly_summary)
column year trips num_stations num_bikes shading_start shading_stop
dtype Int64 Int64 Int64 Int64 Int32 Int32
nunique 8 8 8 8 1 1
missing 0 0 0 0 0 0
0 2018 1830573 359 0 2023 2025
1 2019 2326479 469 4787 2023 2025
2 2020 2739833 611 6417 2023 2025
3 2021 3435503 628 6120 2023 2025
4 2022 4514466 684 6523 2023 2025
5 2023 5800000 820 8110 2023 2025
6 2024 7000000 930 9055 2023 2025
7 2025 8200000 1045 10000 2023 2025
CPU times: user 20 s, sys: 1.32 s, total: 21.3 s
Wall time: 2.03 s

Notes

  1. In order to count the number of pre-2023 stations, the aggregations are counting the station ID columns (start_station_id or end_station_id) and are not using the station name columns (start_station_name or end_station_name). Dropping trips missing a start or end station name will lead to a difference between the analyzed and published yearly totals for the number of stations used. At this stage in the analysis, this SQL query is only creating a high-level overview of pre-2023 system performance using the raw data and planned expansion (starting from 2023). Additionally, in order to count the number of bikes, the station name column is not necessary since only the bike_id column can be used. So here too, dropping trips missing the station's name will lead to a difference compared to the published totals for the number of bikes used. For these two reasons, it was decided to keep trips that were missing a station name in these high-level totals.

Show charts of the overall

  1. trips
  2. number of stations and bikes used in historical ridership and planned to be available during the network expansion period
In [17]:
%%time
chart, charts_dict = vzu.plot_line_chart_grid(
    df=df_yearly_summary,
    xvar='year:O',
    yvars=["trips:Q", 'num_stations:Q', 'num_bikes:Q'],
    xvar_areas="shading_start:O",
    xvar_areas2="shading_stop:O",
    color_by_col_areas = 'index:N',
    areas_opacity = 0.1,
    axis_label_color = '#636363',
    axis_tick_label_color = 'grey',
    annotation_text_color = 'grey',
    ptitles={
        'trips': "Rapily growing ridership with nearly 5M trips in 2022",
        'num_stations': 'Strong yearly expansion of network footprint to resume in 2023',
        'num_bikes': 'Yearly expansion of bicycle inventory to restart in 2023',
    },
    ptitles_x_locs={'trips': 45, 'num_stations': 65, 'num_bikes': 70},
    ytitles={
        'trips': 'Trips', 'num_stations': 'Stations', 'num_bikes': 'Bicycles'
    },
    line_color = 'darkgreen',
    marker_size = 60,
    marker_fill = "white",
    marker_edge_color = 'darkgreen',
    annotation_text_y_loc = 140,
    annotation_text_opacity = 0.85,
    concat_spacing=10,
    axis_label_opacity=0.7,
    axis_label_fontsize=14,
    fig_size=dict(width=675, height=300),
)
chart.save(os.path.join(figures_dir, '01_high_level_summary.png'))
for k,v in charts_dict.items():
    chart = vzu.configure_chart(charts_dict[k], axis_label_fontsize)
    chart.save(os.path.join(figures_dir, f'01_high_level_summary__{k}.png'))
CPU times: user 925 ms, sys: 13.4 ms, total: 938 ms
Wall time: 836 ms
In [18]:
%%time
chart = vzu.plot_line_charts_with_shaded_area(
    pd.concat(
        [
            df_yearly_summary.query("year <= 2022"),
            df_yearly_summary.query("year >= 2023").drop(columns=['trips'])
        ]
    ),
    xvar="year:O",
    yvar="trips:Q",
    yvar2="num_stations:Q",
    xvar_areas="shading_start:O",
    xvar_areas2="shading_stop:O",
    color_by_col_areas = 'index:N',
    y_max = 9_000_000,
    xtitle = 'Trips (solid)',
    ytitle = 'Stations (dashed)',
    axis_label_fontsize = axis_label_fontsize,
    ptitle = alt.TitleParams(
        "Nearly 5M trips in 2022 and Expanding the Network's Footprint",
        anchor='start',
        dx=45,
        fontSize=axis_label_fontsize
    ),
    annotation_text = 'Four-Year Growth Plan',
    annotation_text_color = 'grey',
    annotation_text_opacity = 0.85,
    annotation_text_x_loc = 160,
    annotation_text_y_loc = 125,
    annotation_text_solid=dict(dx=175, dy=75),
    annotation_text_dashed=dict(dx=330, dy=-40),
    areas_opacity = 0.1,
    axis_label_color = 'darkgreen',
    axis_tick_label_color = 'darkgreen',
    axis2_label_color = 'green',
    axis2_tick_label_color = 'green',
    line_color = 'darkgreen',
    line_color2 = '#a1d99b',
    marker_size = 60,
    marker_fill = "white",
    marker_edge_color = 'darkgreen',
    marker_edge_color2 = '#a1d99b',
    fig_size = dict(width=675, height=300),
)
chart
CPU times: user 45 ms, sys: 132 µs, total: 45.2 ms
Wall time: 44.7 ms
Out[18]:

Observations

  1. Network expansion, in terms of bikes and stations added, was minimal during 2021 and also minimal in 2022. So, the dashed line slope is relatively small between 2021 and 2020 and between 2022 and 2021. However, the future expansion plans are similar to pre-2020 growth.
  2. Regardless of expansion, bikeshare ridership continues to show strong growth on a yearly basis.
  3. Overall, the network's usage and footprint are rapidly growing.
In [19]:
%%time
chart = vzu.plot_line_charts_with_shaded_area(
    df_yearly_summary.query('year != 2018'),
    xvar="year:O",
    yvar="num_stations:Q",
    yvar2="num_bikes:Q",
    xvar_areas="shading_start:O",
    xvar_areas2="shading_stop:O",
    color_by_col_areas = 'index:N',
    y_max = 1_200,
    xtitle = 'Stations (solid)',
    ytitle = 'Bicycles (dashed)',
    axis_label_fontsize = axis_label_fontsize,
    ptitle = alt.TitleParams(
        "Located in Half the City's Neighbourhoods and Expanding Bicycle Inventory",
        anchor='start',
        dx=65,
        fontSize=axis_label_fontsize
    ),
    annotation_text = 'Four-Year Growth Plan',
    annotation_text_color = 'grey',
    annotation_text_opacity = 0.85,
    annotation_text_x_loc = 170,
    annotation_text_y_loc = 125,
    annotation_text_solid=dict(dx=200, dy=5),
    annotation_text_dashed=dict(dx=280, dy=-50),
    areas_opacity = 0.1,
    axis_label_color = 'darkgreen',
    axis_tick_label_color = 'darkgreen',
    axis2_label_color = 'green',
    axis2_tick_label_color = 'green',
    line_color = 'darkgreen',
    line_color2 = '#a1d99b',
    marker_size = 60,
    marker_fill = "white",
    marker_edge_color = 'darkgreen',
    marker_edge_color2 = '#a1d99b',
    fig_size = dict(width=675, height=300),
)
chart
CPU times: user 47.1 ms, sys: 0 ns, total: 47.1 ms
Wall time: 46.4 ms
Out[19]:

Observations

  1. The slowdown in network expansion during 2021 and 2022 is also visible in the number of bikes available in the network on a yearly basis. In this chart, the number of bikes used in bike share trips across the network is shown. All bikes available might not have been used in the ridership data after it was processed in the previous step. For this reason, the number of stations shows a drop in 2022.
  2. Although the number of neighbourhoods served by the service is flat during the three years from 2020 to 2022, expansion is planned to target new neighbourhoods (not shown in the chart).

All charts shown above display totals based on processed bike share ridership data that was performed in the previous step. These totals will differ from totals published elsewhere due to differences in data processing.

Summary of Metrics Used

  1. total ridership (number of trips) across entire network
  2. total number of bikes used in ridership across entire network
  3. total number of stations used in ridership across entire network

Explore Attributes of Stations by Station Type¶

The station types are

  1. Top-Performing stations
  2. Other stations (excluding top-performers)

Get the Fraction of Stations Located in Downtown Toronto¶

  1. From the ranked stations, count number of stations by station type and location type (whether station is located downtown or not)
  2. reshape into untidy data
  3. calculate fraction of top-performing stations that are located downtown
In [20]:
%%time
query = f"""
        WITH t1 AS (
            -- 1. count number of stations by station and location type
            SELECT -- change the top-performing station column from a boolean to a string
                   (
                       CASE WHEN is_top_perform_station = True
                       THEN 'Top Performer'
                       ELSE 'Others' END
                   ) AS station_type,
                   -- append conditional to indicate if station is located in a downtown neighbourhood
                   (
                       CASE WHEN is_downtown = True
                       THEN 'Downtown'
                       ELSE 'Other_Locations' END
                   ) AS location_type,
                   -- count the number of stations that are and are not located in a downtown
                   -- neighbourhood
                   COUNT(DISTINCT(station_id)) AS num_stations
            FROM df_stations
            GROUP BY ALL
        ),
        -- 2. reshape using pivot table
        t2 AS (
            PIVOT t1
            ON location_type
            USING sum(num_stations)
        )
        -- 3. calculate fraction of stations that are and are not located in downtown neighbourhoods
        SELECT *,
               Downtown+Other_Locations AS total,
               100*Downtown/total AS frac_downtown
        FROM t2
        """
df_downtown_query = run_sql_query(query).convert_dtypes()
df_downtown_query
CPU times: user 41.5 ms, sys: 5.06 ms, total: 46.5 ms
Wall time: 28.5 ms
Out[20]:
station_type Downtown Other_Locations total frac_downtown
0 Top Performer 76 22 98 77.55102
1 Others 137 392 529 25.897921

Show above as a chart

In [21]:
%%time
chart = vzu.plot_bar_chart(
    df_downtown_query,
    xvar='frac_downtown:Q',
    yvar='station_type:N',
    xtitle='Downtown Stations (%)',
    y_axis_sort=['Top Performer', 'Others'],
    color_by_col='station_type:N',
    color_labels=['Top Performer', 'Others'],
    color_values=['darkgreen', 'lightgrey'],
    ptitle=alt.TitleParams(
        'Nearly 3X Top-Performers are Downtown as Others',
        anchor='start',
        dx=100,
        fontSize=axis_label_fontsize
    ),
    axis_label_fontsize=axis_label_fontsize,
    fig_size=dict(width=450, height=125),
)
chart.save(
    os.path.join(figures_dir, '02_top_performers_downtown.png')
)
chart
CPU times: user 76.6 ms, sys: 0 ns, total: 76.6 ms
Wall time: 48.4 ms
Out[21]:

Observations

  1. Nearly 80% of the top-performing stations are located downtown. Most bike share systems start by setting up the majority of their stations in the downtown or financial district of the city. The same is true for Bike Share Toronto. A large density of number of office buildings and the city's financial district are located in Downtown Toronto. By locating the majority of its top-peforming stations downtown as of the end of 2022, the service continues have strong support for working professionals.

Summary of Metrics Used

  1. fraction of bike share stations located in downtown Toronto

Get the Fraction of Stations Located In on Near Downtown Neighbourhoods¶

  1. From the ranked stations, count number of stations by station type and neighbourhood, for neighbourhoods close to Downtown Toronto
  2. reshape into untidy data
  3. calculate fraction of top-performing stations by neighbourhood
  4. reshape into tidy data
In [22]:
%%time
query = f"""
        WITH t1 AS (
            -- 1. count number of stations by station and neighbourhood
            SELECT -- change the top-performing station column from a boolean to a string
                   (
                       CASE WHEN is_top_perform_station = True
                       THEN 'Top Performer'
                       ELSE 'Others' END
                   ) AS station_type,
                   -- append conditional to indicate if station is located in a downtown neighbourhood
                   REPLACE(location, ' ', '_') AS location,
                   -- count the number of stations that are and are not located in a downtown
                   -- neighbourhood
                   COUNT(DISTINCT(station_id)) AS num_stations
            FROM df_stations
            GROUP BY ALL
        ),
        -- 2. reshape using pivot table
        t2 AS (
            PIVOT t1
            ON location
            USING sum(num_stations)
        ),
        -- 3. calculate fraction of stations by neighbourhood
        t3 AS (
            SELECT *,
                   Downtown+East_of_Downtown+Others+West_of_Downtown AS total,
                   100*Downtown/total AS frac_Downtown,
                   100*East_of_Downtown/total AS frac_East_of_Downtown,
                   100*West_of_Downtown/total AS frac_West_of_Downtown,
                   100*Others/total AS frac_Other_Neighbourhoods
            FROM t2
        ),
        -- 4. reshape into tidy data
        t4 AS (
            UNPIVOT (
                SELECT station_type,
                       frac_downtown,
                       frac_East_of_Downtown,
                       frac_West_of_Downtown,
                       frac_Other_Neighbourhoods
                FROM t3
            )
            ON frac_downtown,
               frac_East_of_Downtown,
               frac_West_of_Downtown,
               frac_Other_Neighbourhoods
            INTO NAME variable VALUE value
        )
        SELECT * EXCLUDE(variable),
               REPLACE(
                   REPLACE(variable, 'frac_', ''),
                   '_',
                   ' '
               ) AS variable
        FROM t4
        """
df_downtown_adjacent_query = run_sql_query(query).convert_dtypes()
df_downtown_adjacent_query
CPU times: user 42.6 ms, sys: 0 ns, total: 42.6 ms
Wall time: 26.9 ms
Out[22]:
station_type value variable
0 Others 25.897921 Downtown
1 Others 5.860113 East of Downtown
2 Others 14.555766 West of Downtown
3 Others 53.6862 Other Neighbourhoods
4 Top Performer 77.55102 Downtown
5 Top Performer 1.020408 East of Downtown
6 Top Performer 16.326531 West of Downtown
7 Top Performer 5.102041 Other Neighbourhoods

Show above as a chart

In [23]:
%%time
chart = vzu.plot_non_grouped_bar_chart_grid(
    df=df_downtown_adjacent_query,
    rows_labels_colors={
        'Top Performer': {
            'labels': [
                'Downtown',
                'West of Downtown',
                'Other Neighbourhoods',
                'East of Downtown',
            ],
            'colors': [
                'darkgreen',
                'darkred',
                'lightgrey',
                '#a1d99b',
            ],
        },
        'Others': {
            'labels': [
                'Other Neighbourhoods',
                'Downtown',
                'West of Downtown',
                'East of Downtown',
            ],
            'colors': [
                'lightgrey',
                'darkgreen',
                'darkred',
                '#a1d99b',
            ],
        }
    },
    xvar='value:Q',
    yvar='variable:N',
    color_by_col='variable:N',
    legend_label_font_color='#636363',
    title_text='Downtown and West Outperform Other Neighbourhoods',
    xtitle_text='Stations (%)',
    title_fontweight='normal',
    title_font_color='grey',
    axis_label_fontsize=14,
    ptitle_xloc=20,
    fig_size = dict(width=450, height=125),
)
chart.save(
    os.path.join(
        figures_dir, '03_top_performers_downtown_neighbourhoods.png'
    )
)
chart
CPU times: user 112 ms, sys: 12 ms, total: 124 ms
Wall time: 77.9 ms
Out[23]:

Observations

  1. As was seen in the previous chart, Downtown neighbourhoods capture up nearly 80% of all top performing stations. The next highest grouping of neighbourhoods is those adjacent to Downtown Toronto and to its West, which captures nearly 16% of the top-performers. Notably, the neighbourhoods adjacent to the east of the downtown neighbourhoods capture approximately 5% of these top-performing stations.
  2. For stations that are not top-performers, the same ordering is seen, with downtown neighbourhoods ahead of adjacent western neighbourhoods and finally adjacent eastern neighbourhoods. As expected, more than half of these non- top-performing stations are not found in Downtown Toronto. By comparison, the downtown neighbourhoods account for approximately one quarter of these stations.

Summary of Metrics Used

  1. fraction of bike share stations by neighbourhood, for neighbourhoods close to Downtown Toronto

Get the Fraction of Stations Accepting Payment via Credit Card¶

  1. From the ranked stations, count number of stations by station type and whether the station accepts credit card payments or not
  2. reshape into untidy data
  3. calculate fraction of stations that do and do not accept credit card payments when checking out a bike from a dock at the station
In [24]:
%%time
query = f"""
        WITH t1 AS (
            -- 1. count number of stations by station type and whether they accept credit card
            -- payments
            SELECT -- change the top-performing station column from a boolean to a string
                   (
                       CASE WHEN is_top_perform_station = True
                       THEN 'Top Performer'
                       ELSE 'Others' END
                   ) AS station_type,
                   -- indicate if station accepts payments using a credit card
                   (
                       CASE WHEN credit = True
                       THEN 'Credit_accepted'
                       ELSE 'Credit_not_accepted' END
                   ) AS takes_credit,
                   -- count the number of stations that do and do not accept payments using a
                   -- credit card
                   COUNT(DISTINCT(station_id)) AS num_stations
            FROM df_stations
            GROUP BY ALL
        ),
        -- 2. reshape using pivot table
        t2 AS (
            PIVOT t1
            ON takes_credit
            USING sum(num_stations)
        )
        -- 3. calculate fraction of stations that do and do not accept credit card payments
        SELECT *,
               Credit_accepted+Credit_not_accepted AS total,
               100*Credit_accepted/total AS frac_takes_credit
        FROM t2
        """
df_payment_methods_query = run_sql_query(query).convert_dtypes()
df_payment_methods_query
CPU times: user 25.2 ms, sys: 2.34 ms, total: 27.5 ms
Wall time: 17.5 ms
Out[24]:
station_type Credit_accepted Credit_not_accepted total frac_takes_credit
0 Others 419 110 529 79.206049
1 Top Performer 85 13 98 86.734694

Show above as a chart

In [25]:
%%time
chart = vzu.plot_bar_chart(
    df_payment_methods_query,
    xvar='frac_takes_credit:Q',
    yvar='station_type:N',
    xtitle='Stations (%)',
    y_axis_sort=['Top Performer', 'Others'],
    color_by_col='station_type:N',
    color_labels=['Top Performer', 'Others'],
    color_values=['darkgreen', 'lightgrey'],
    ptitle=alt.TitleParams(
        'Stations Accept Credit Card as a Method of Payment',
        anchor='start',
        dx=100,
        fontSize=axis_label_fontsize
    ),
    axis_label_fontsize=axis_label_fontsize,
    fig_size=dict(width=450, height=125),
)
chart.save(
    os.path.join(
        figures_dir, '04_top_performers_payment_methods.png'
    )
)
chart
CPU times: user 35.8 ms, sys: 0 ns, total: 35.8 ms
Wall time: 31.8 ms
Out[25]:

Observations

  1. Almost all type of stations accept credit card as a method of payment. This would be a highly conveneint method of checking out a bike for most bike share users. It is reassuring that nearly all stations support this.

Summary of Metrics Used

  1. fraction of bike share stations that accept payments using a credit card

Get the Fraction of Stations Based on Their Configuration¶

  1. From the ranked stations, count number of stations by station type and physical configuration
  2. reshape into untidy data for convenience when calculating fractions
  3. group the MAPFRAME (kiosk-less) configurations into a single category in order to get three overall categories
    • REGULAR (classic station, the most popular)
    • NO-KIOSK (no kiosk)
    • VAULT
    • ELECTRICBIKESTATION (supports e-bike charging)
  4. calculate fraction of stations based on their physical configuration (one of REGULAR, NO-KIOSK, VAULT and ELECTRICBIKESTATION)
  5. reshape into untidy data for plotting
In [26]:
%%time
query = f"""
        WITH t1 AS (
            -- 1. count number of stations by station type and configuration
            SELECT -- change the top-performing station column from a boolean to a string
                   (
                       CASE WHEN is_top_perform_station = True
                       THEN 'Top Performer'
                       ELSE 'Others' END
                   ) AS station_type,
                   -- extract station configuration (REGULAR, *MAPFRAME, ELECTRICBIKESTATION)
                   physical_configuration,
                   -- count stations
                   COUNT(DISTINCT(station_id)) AS num_stations
            FROM df_stations
            GROUP BY ALL
        ),
        -- 2. reshape into untidy data using pivot table
        t2 AS (
            PIVOT t1
            ON physical_configuration
            USING sum(num_stations)
        ),
        -- 3. create a new physical configuration as no-kiosk stations which includes
        -- all MAPFRAME configurations combined
        t3 AS (
            SELECT *,
                   -- calculate total number of stations in each category
                   (
                       COALESCE(ELECTRICBIKESTATION, NULL, 0)
                       +COALESCE(REGULAR, NULL, 0)
                       +COALESCE(REGULARLITMAPFRAME, NULL, 0)
                       +COALESCE(SMARTMAPFRAME, NULL, 0)
                       +COALESCE(SMARTLITMAPFRAME, NULL, 0)
                   ) AS total,
                   -- calculate total number of stations in the new no-kiosk (*MAPFRAME) category
                   (
                       COALESCE(REGULARLITMAPFRAME, NULL, 0)
                       +COALESCE(SMARTMAPFRAME, NULL, 0)
                       +COALESCE(SMARTLITMAPFRAME, NULL, 0)
                   ) AS No_Kiosk
            FROM t2
        ),
        -- 4. calculate fraction of stations in REGULAR, NO-KIOSK, VAULT & ELECTRICBIKESTATION configs
        t4 AS (
            SELECT station_type,
                   100*COALESCE(ELECTRICBIKESTATION, NULL, 0)/total AS Electric,
                   100*COALESCE(REGULAR, NULL, 0)/total AS Regular,
                   100*COALESCE(VAULT, NULL, 0)/total AS Vault,
                   100*COALESCE(SMARTMAPFRAME, NULL, 0)/total AS 'No Kiosk'
            FROM t3
        ),
        -- 5. reshape into tidy data using pivot table
        t5 AS (
            UNPIVOT t4
            ON Electric, Regular, Vault, 'No Kiosk'
            INTO NAME physical_configuration
            VALUE value
        )
        SELECT *
        FROM t5
        """
df_physical_config_query = run_sql_query(query).convert_dtypes()
pu.show_df(df_physical_config_query)
column station_type physical_configuration value
dtype string[python] string[python] Float64
nunique 2 4 8
missing 0 0 0
0 Top Performer Electric 1.020408
1 Top Performer Regular 85.714286
2 Top Performer Vault 0.0
3 Top Performer No Kiosk 13.265306
4 Others Electric 2.816901
5 Others Regular 81.086519
6 Others Vault 6.438632
7 Others No Kiosk 15.492958
CPU times: user 105 ms, sys: 17 ms, total: 122 ms
Wall time: 94.2 ms

Show above as a chart

In [27]:
%%time
chart = vzu.plot_grouped_bar_chart(
    df=df_physical_config_query,
    xvar="physical_configuration:N",
    yvar="sum(value):Q",
    color_by_col="physical_configuration:N",
    column_var="station_type:N",
    ytitle='Stations (%)',
    color_labels=['Regular', 'No Kiosk', 'Vault', 'Electric'],
    color_values=['darkgreen', 'lightgrey', 'lightblue', 'lightgreen'],
    axis_label_fontsize=14,
    ptitle=alt.TitleParams(
        'Top-Performing Stations Have Preference for Guidance Kiosk',
        anchor='start',
        dx=20,
        fontSize=axis_label_fontsize,
    ),
    facet_column_spacing=20,
    fig_size=dict(width=250, height=300),
)
chart.save(
    os.path.join(
        figures_dir, '05_top_performers_physical_configurations.png'
    )
)
chart
CPU times: user 55.6 ms, sys: 3.07 ms, total: 58.7 ms
Wall time: 50.9 ms
Out[27]:

Notes

  1. Cleaning of smart map frame stations (page 18/86)
  2. Smart map frame docks do not use the ticket kiosk to check out a bike (page 12/20)

Observations

  1. Top performing stations are found in a Regular configuration. These station configurations require the user to check out or check in a bike using a kiosk. Earlier, it was seen that most of the top-performers are also located downtown and that more stations were located there when the service was initially launched in order to gain the highest possible ridership. It is likely that a regular station configuration, with written instructions on the kiosk that provides guidance to the user, has kept users coming back to these stations due to both their ease-of-use and convenient location to office buildings. By comparison, the majority of MAPFRAME stations that gives the option to use the bike dock itself without a kiosk are likely not located in the areas of the heaviest usage. So, these stations can introduce more experimental features like a kiosk-less check-out process without negatively effecting the user experience (ex. due to technical difficulties) at high demand locations.

Summary of Metrics Used

  1. fraction of bike share stations within each of the following station configurations
    • regular
    • e-bike
    • kiosk-less

Get the Fraction of Overall and Weekday/Weekend Top-Performers¶

Get the fraction of overall top-performers that are also top-performing stations on weekdays and weekends

In [28]:
def get_station_type(
    df: pd.DataFrame, station_type_mapper: Dict[str, str]
) -> pd.DataFrame:
    """Get fraction of simultaneous overall, weekday & weekend top-performers."""
    df_station_types = pd.DataFrame.from_records(
        [
            {
                'station_type': label,
                "num_stations": len(df.query(query)),
                'frac_stations_overall': (
                    100*len(df.query(query))/len(df.query("is_top_perform_station"))
                ),
            }
            for query, label in station_type_mapper.items()
        ]
    )
    return df_station_types
In [29]:
station_type_criteria = {
    (
        "(is_top_perform_station_weekday == True) & "
        "(is_top_perform_station_weekend == True) & "
        "(is_top_perform_station == True)"
    ): 'Includes Weekdays & Weekends',
    (
        '((is_top_perform_station_weekday == False) | '
        '(is_top_perform_station_weekend == False)) & '
        '(is_top_perform_station == True)'
    ): 'Excludes Weekdays or Weekends',
}
df_top_perform_overall_intra_week = (
    get_station_type(df_stations, station_type_criteria)
    .convert_dtypes()
)
pu.show_df(df_top_perform_overall_intra_week)
column station_type num_stations frac_stations_overall
dtype string[python] Int64 Float64
nunique 2 2 2
missing 0 0 0
0 Includes Weekdays & Weekends 79 80.612245
1 Excludes Weekdays or Weekends 19 19.387755

Show above as a chart

In [30]:
%%time
chart = vzu.plot_pie_chart(
    df_top_perform_overall_intra_week,
    yvar = "frac_stations_overall:Q",
    color_by_col = "station_type:N",
    ptitle=alt.TitleParams(
        'Most Top-Performers Show Consistent Intra-Week Performance',
        anchor='start',
        dx=0,
        fontSize=axis_label_fontsize
    ),
    label_non_white_color = 'Includes Weekdays & Weekends',
    radius = 180,
    text_radius_offset = 230,
    axis_label_fontsize = 14,
    annotation_label_fontsize = 18,
    annotation_radius = 90,
    yvar_non_white_threshold = 25,
    stroke_thickness = 5,
    x_loc_annotation = 0,
    x_loc_label = 40,
    colors = {
        'Includes Weekdays & Weekends': 'darkgreen',
        'Excludes Weekdays or Weekends': '#c7e9c0',
    },
)
chart.save(
    os.path.join(
        figures_dir, '06_top_performers_day_of_week.png'
    )
)
chart
CPU times: user 56.2 ms, sys: 4.48 ms, total: 60.7 ms
Wall time: 58.3 ms
Out[30]:

Observations

  1. Only 20% of top-performing stations are also top-performers on one of weekends only or weekends only. The majority of overall top-performing stations are also top-performers on both weekdays only and weekends only. Without supporting information, we can't eliminate the former grouping due to its smaller size since it might capture an important base of bike share users. Bike share trips departing from and arriving at both groupings of stations will be used in subsequent analysis.

Summary of Metrics Used

  1. fraction of overall top-performing bike share stations that are also top-performers on weekdays and weekends

User Behaviour Insights¶

For all user behaviour (bike share ridership patterns) insights, the number of trips per period (month, hour, day of week) will be calculated for both station types. The two types of stations are top-performers and all other stations. By definition there are fewer top-performing stations than other types of stations. So, the raw totals for number of trips at top-performing stations will always smaller than the total trips across the other stations during the chosen period. This might incorrectly suggest that top-performers are under-performing relative to all other stations. Additionally, ridership patterns by both types of users (Annual and Casual users) might be different per period.

In order to take these factors into account, instead of using the raw totals, the metric to be compared when extracting insights from bike share ridership will be the average ridership per station during a given period. To do this, for each type of station, the total number of trips during a given period will be divided by the number of stations used during that same period. This metric will be calculated separately for each type of user. Also, the metric will be calculated per year or per year and month. It will be used to explore periodic user behaviour through bike share ridership patterns at each type of station and by both types of users.

Show Monthly Ridership By Station Type¶

Get the average monthly trips per station during 2018, per user type and station type using the following approach

  1. get monthly departures per user type and station
  2. get monthly arrivals per user type and station
  3. combine departures and arrivals (using UNION) and then reshape into untidy data so that, for each station, the total departures and arrivals appear as columns
  4. combine (LEFT JOIN) with table indicating if station is a top-performer
  5. fill any missing values in the is_top_performer column after combining in step 4.
  6. get monthly departures and arrivals per station type and user type and calculate the average number of monthly departures and arrivals per station by taking the ratio if the total monthly departures and arrivals columns to the total stations used per month
In [31]:
%%time
query = f"""
        WITH t1 AS (
            -- 1. get number of monthly departures per station, user type and month in 2018
            SELECT start_station_id AS station_id,
                   started_at_year AS year,
                   started_at_month AS month,
                   REPLACE(user_type, ' Member', '') AS user_type,
                   COUNT(DISTINCT(trip_id)) AS trips,
                   'departures' AS type
            FROM read_parquet({fpaths_proc[2018]})
            GROUP BY all
        ),
        t2 AS (
            -- 2. get number of monthly arrivals per station, user type and month in 2018
            SELECT end_station_id AS station_id,
                   ended_at_year AS year,
                   ended_at_month AS month,
                   REPLACE(user_type, ' Member', '') AS user_type,
                   COUNT(DISTINCT(trip_id)) AS trips,
                   'arrivals' AS type
            FROM read_parquet({fpaths_proc[2018]})
            WHERE ended_at_year = 2018
            GROUP BY all
        ),
        -- 3. combine monthly departures and arrivals per station
        -- first use UNION to capture stations with only departures or only arrivals
        -- then reshape into untidy data to get departures & arrivals per station as columns
        t3 AS (
            PIVOT (
                SELECT *
                FROM t1
                UNION
                SELECT *
                FROM t2
            )
            ON type USING MAX(trips)
        ),
        -- 4. LEFT JOIN with table indicating if station is a top-performer
        t4 AS (
            SELECT *
            FROM t3
            LEFT JOIN (
                SELECT station_id,
                       is_top_perform_station
                FROM df_stations
            ) USING (station_id)
        ),
        -- 5. fill any missing values in the is_top_perform_station column after above
        -- LEFT JOIN with FALSE
        t5 AS (
            SELECT --fill missing values in is_top_perform_station with FALSE
                   * EXCLUDE(is_top_perform_station, departures, arrivals),
                   COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station,
                   COALESCE(departures, NULL, 0) AS departures,
                   COALESCE(arrivals, NULL, 0) AS arrivals
            FROM t4
        ),
        -- 6. calculate average monthly trips per station, station type and user type
        t6 AS (
            SELECT -- change the top-performing station column from a boolean to a string
                   (
                       CASE WHEN is_top_perform_station = True
                       THEN 'top-performer'
                       ELSE 'others' END
                   ) AS station_type,
                   user_type,
                   year,
                   month,
                   -- number of stations used
                   COUNT(DISTINCT(station_id)) AS num_stations_used,
                   -- strftime(make_date(year, month, 1), '%Y-%m') AS date_ym,
                   make_date(year, month, 1) AS date_ym,
                   -- get departures
                   SUM(departures) AS departures,
                   -- get arrivals
                   SUM(arrivals) AS arrivals,
                   -- get average departures
                   SUM(departures)/num_stations_used AS avg_departures_per_station,
                   -- get average arrivals
                   SUM(arrivals)/num_stations_used AS avg_arrivals_per_station,
                   strftime(make_date(2018, 7, 1), '%Y-%m') AS peak_ridership
            FROM t5
            GROUP BY ALL
            ORDER BY ALL
        )
        SELECT *
        FROM t6
        """
df_by_month_2018 = run_sql_query(query).convert_dtypes()
pu.show_df(df_by_month_2018)
column station_type user_type year month num_stations_used date_ym departures arrivals avg_departures_per_station avg_arrivals_per_station peak_ridership
dtype string[python] string[python] Int32 Int32 Int64 datetime64[us] Int64 Int64 Float64 Float64 string[python]
nunique 2 2 1 12 20 12 48 48 48 48 1
missing 0 0 0 0 0 0 0 0 0 0 0
0 others Annual 2018 1 192 2018-01-01 22522 22221 117.302083 115.734375 2018-07
1 others Annual 2018 2 193 2018-02-01 25536 25219 132.310881 130.668394 2018-07
2 others Annual 2018 3 192 2018-03-01 42383 41675 220.744792 217.057292 2018-07
3 others Annual 2018 4 192 2018-04-01 44578 44226 232.177083 230.34375 2018-07
4 others Annual 2018 5 193 2018-05-01 88022 86699 456.072539 449.217617 2018-07
5 others Annual 2018 6 210 2018-06-01 101149 100255 481.661905 477.404762 2018-07
6 others Annual 2018 7 245 2018-07-01 114157 113672 465.946939 463.967347 2018-07
7 others Annual 2018 8 265 2018-08-01 112639 112759 425.05283 425.50566 2018-07
8 others Annual 2018 9 270 2018-09-01 114040 114485 422.37037 424.018519 2018-07
9 others Annual 2018 10 270 2018-10-01 89755 89136 332.425926 330.133333 2018-07
10 others Annual 2018 11 270 2018-11-01 56093 55611 207.751852 205.966667 2018-07
11 others Annual 2018 12 270 2018-12-01 45113 44430 167.085185 164.555556 2018-07
12 others Casual 2018 1 158 2018-01-01 582 568 3.683544 3.594937 2018-07
13 others Casual 2018 2 172 2018-02-01 914 939 5.313953 5.459302 2018-07
14 others Casual 2018 3 188 2018-03-01 2593 2539 13.792553 13.505319 2018-07
15 others Casual 2018 4 191 2018-04-01 4650 4561 24.34555 23.879581 2018-07
16 others Casual 2018 5 192 2018-05-01 18371 18236 95.682292 94.979167 2018-07
17 others Casual 2018 6 210 2018-06-01 24843 24442 118.3 116.390476 2018-07
18 others Casual 2018 7 245 2018-07-01 28759 28411 117.383673 115.963265 2018-07
19 others Casual 2018 8 264 2018-08-01 30440 29888 115.30303 113.212121 2018-07
20 others Casual 2018 9 269 2018-09-01 20422 20400 75.918216 75.836431 2018-07
21 others Casual 2018 10 267 2018-10-01 6624 6623 24.808989 24.805243 2018-07
22 others Casual 2018 11 251 2018-11-01 1630 1693 6.494024 6.74502 2018-07
23 others Casual 2018 12 237 2018-12-01 1173 1172 4.949367 4.945148 2018-07
24 top-performer Annual 2018 1 77 2018-01-01 19651 19949 255.207792 259.077922 2018-07
25 top-performer Annual 2018 2 77 2018-02-01 21394 21710 277.844156 281.948052 2018-07
26 top-performer Annual 2018 3 77 2018-03-01 35572 36280 461.974026 471.168831 2018-07
27 top-performer Annual 2018 4 77 2018-04-01 36763 37111 477.441558 481.961039 2018-07
28 top-performer Annual 2018 5 77 2018-05-01 70712 72026 918.337662 935.402597 2018-07
29 top-performer Annual 2018 6 84 2018-06-01 82588 83481 983.190476 993.821429 2018-07
30 top-performer Annual 2018 7 85 2018-07-01 97056 97546 1141.835294 1147.6 2018-07
31 top-performer Annual 2018 8 85 2018-08-01 93024 92899 1094.4 1092.929412 2018-07
32 top-performer Annual 2018 9 85 2018-09-01 89627 89196 1054.435294 1049.364706 2018-07
33 top-performer Annual 2018 10 87 2018-10-01 68707 69320 789.735632 796.781609 2018-07
34 top-performer Annual 2018 11 88 2018-11-01 43753 44236 497.193182 502.681818 2018-07
35 top-performer Annual 2018 12 88 2018-12-01 34849 35539 396.011364 403.852273 2018-07
36 top-performer Casual 2018 1 77 2018-01-01 588 602 7.636364 7.818182 2018-07
37 top-performer Casual 2018 2 77 2018-02-01 1035 1010 13.441558 13.116883 2018-07
38 top-performer Casual 2018 3 77 2018-03-01 2557 2611 33.207792 33.909091 2018-07
39 top-performer Casual 2018 4 77 2018-04-01 4734 4820 61.480519 62.597403 2018-07
40 top-performer Casual 2018 5 77 2018-05-01 19448 19577 252.571429 254.246753 2018-07
41 top-performer Casual 2018 6 84 2018-06-01 26040 26429 310.0 314.630952 2018-07
42 top-performer Casual 2018 7 85 2018-07-01 28515 28874 335.470588 339.694118 2018-07
43 top-performer Casual 2018 8 85 2018-08-01 28435 28981 334.529412 340.952941 2018-07
44 top-performer Casual 2018 9 85 2018-09-01 19031 19069 223.894118 224.341176 2018-07
45 top-performer Casual 2018 10 87 2018-10-01 6773 6774 77.850575 77.862069 2018-07
46 top-performer Casual 2018 11 87 2018-11-01 1552 1488 17.83908 17.103448 2018-07
47 top-performer Casual 2018 12 87 2018-12-01 1181 1183 13.574713 13.597701 2018-07
CPU times: user 3.37 s, sys: 92.4 ms, total: 3.46 s
Wall time: 567 ms

Notes

  1. A similar approach is used to get the average trips using other temporal aggregations (month of year, hour of day, weekday or weekend).

Show above as a chart

In [32]:
%%time
for yvar, title_text in zip(
    ['avg_departures_per_station:Q', 'avg_arrivals_per_station:Q'],
    ['Departures', 'Arrivals'],
):
    chart = vzu.plot_grouped_line_charts(
        df_by_month_2018,
        pd.DataFrame({'date_ym': [pd.to_datetime('2018-07-01')]}),
        xvar="yearmonth(date_ym):T",
        yvar=yvar,
        color_by_col="user_type:N",
        color_labels=['Annual', 'Casual'],
        color_values=['darkgreen', '#a1d99b'],
        legend=alt.Legend(
            titleFontSize=axis_label_fontsize,
            labelFontSize=axis_label_fontsize,
            labelColor='black',
            orient='bottom',
            direction='horizontal',
            titleOrient='left',
        ),
        annotation_text='Middle of year',
        annotation_text_color='red',  # 'grey'
        annotation_text_opacity=0.5,
        annotation_text_x_loc=-60,
        annotation_text_y_loc=-135,
        xvar_rule="yearmonth(peak_ridership):T",
        color_rule='',  # '#cccccc'
        title_text=f'Peak 2018 {title_text} Occurs During Warmest Months',
        sub_title_text_color='#7f7f7f',
        sub_title_vertical_padding=5,
        axis_label_fontsize=14,
        x_axis_ticks=None,
        concat='column',
        fig_size=dict(width=450, height=300),
    )
    display(chart)
CPU times: user 147 ms, sys: 11.9 ms, total: 158 ms
Wall time: 158 ms

Observations

  1. Peak ridership is observed during the warmer months of the year, for both types of stations and for both types of users (Casual and Annual).
  2. The monthly ridership patterns are similar for departures and arrivals since the network has higher usage during the warmer months. Trips departing from a location in the network must arrive at another station, so if departures follow a monthly pattern then it is not surprising that arrivals follow a similar pattern. Due to this strong similarity between trends for departures and arrivals, for further exploration only departures will be considered and this will be taken as trips.
  3. Annual ridership was dominant during 2018, more than double the Casual ridership. This is a patten seen at many bike share systems due to the proximity of a large number of stations to office buildings in downtown locations with a city.

Get the average monthly trips per station for each year and month in all available data per user type and station type, separately for weekdays and weekends

In [33]:
%%time
dfs_by_month = {}
for day_type, dow in zip(['weekday', 'weekend'], [[0, 1, 2, 3, 4], [5, 6]]):
    dow_str = '('+', '.join([str(d) for d in dow])+')'
    query = f"""
            WITH t1 AS (
                -- 1. get number of monthly trips per station, user type, year & month in all data
                SELECT start_station_id AS station_id,
                       started_at_year AS year,
                       started_at_month AS month,
                       REPLACE(user_type, ' Member', '') AS user_type,
                       COUNT(DISTINCT(trip_id)) AS trips
                FROM read_parquet({fpaths_proc_all})
                WHERE ISODOW(started_at)-1 IN {dow_str}
                GROUP BY all
            ),
            -- 2. LEFT JOIN with table indicating if station is a top-performer
            t2 AS (
                SELECT *
                FROM t1
                LEFT JOIN (
                    SELECT station_id,
                           is_top_perform_station
                    FROM df_stations
                ) USING (station_id)
            ),
            -- 3. fill any missing values in the is_top_perform_station column after LEFT JOIN
            -- with FALSE
            t3 AS (
                SELECT --fill missing values in is_top_perform_station with FALSE
                       * EXCLUDE(is_top_perform_station),
                       COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station
                FROM t2
            ),
            -- 4. calculate average monthly trips per station, station type, user type, year and month
            t4 AS (
                SELECT -- change the top-performing station column from a boolean to a string
                       (
                           CASE WHEN is_top_perform_station = True
                           THEN 'top-performer'
                           ELSE 'others' END
                       ) AS station_type,
                       user_type,
                       year,
                       month,
                       -- get number of stations
                       COUNT(DISTINCT(station_id)) AS num_stations_used,
                       strftime(make_date(year, month, 1), '%Y-%m') AS date_ym,
                       -- get trips (ridership)
                       SUM(trips) AS trips,
                       -- get average ridership
                       SUM(trips)/num_stations_used AS avg_trips_per_station,
                       strftime(make_date(2021, 10, 1), '%Y-%m') AS corporate_start
                FROM t3
                GROUP BY ALL
                ORDER BY ALL
            )
            SELECT *
            FROM t4
            """
    dfs_by_month[day_type] = run_sql_query(query).convert_dtypes()
pu.show_df(dfs_by_month['weekday'])
column station_type user_type year month num_stations_used date_ym trips avg_trips_per_station corporate_start
dtype string[python] string[python] Int32 Int32 Int64 string[python] Int64 Float64 string[python]
nunique 2 2 6 12 98 63 252 252 1
missing 0 0 0 0 0 0 0 0 0
0 others Annual 2018 1 191 2018-01 18577 97.26178 2021-10
1 others Annual 2018 2 192 2018-02 21486 111.90625 2021-10
2 others Annual 2018 3 192 2018-03 34113 177.671875 2021-10
3 others Annual 2018 4 192 2018-04 36652 190.895833 2021-10
4 others Annual 2018 5 192 2018-05 73430 382.447917 2021-10
... ... ... ... ... ... ... ... ... ...
247 top-performer Casual 2022 11 96 2022-11 59875 623.697917 2021-10
248 top-performer Casual 2022 12 96 2022-12 37288 388.416667 2021-10
249 top-performer Casual 2023 1 96 2023-01 38847 404.65625 2021-10
250 top-performer Casual 2023 2 96 2023-02 38214 398.0625 2021-10
251 top-performer Casual 2023 3 96 2023-03 55798 581.229167 2021-10

252 rows × 9 columns

CPU times: user 16.6 s, sys: 538 ms, total: 17.2 s
Wall time: 1.75 s

Notes

  1. It may be possible to replace t2 and t3 by a single CTE that both performs the LEFT JOIN and then fills missing values as shown below, while also respecting SQL order of operations

    -- LEFT JOIN with table indicating if station is a top-performer and fill missing values
    t3 AS (
        SELECT * EXCLUDE(is_top_perform_station),
               COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station
        FROM t1
        LEFT JOIN (
            SELECT station_id,
                   is_top_perform_station
            FROM df_stations
        ) USING (station_id)
    ),
    

    This is more condensed than the version used above. In future work, it should be verified that this approach can be used.

Show above as a chart

In [34]:
%%time
for day_type, fname_no_ext in zip(
    ['weekday', 'weekend'],
    ['07_monthly_weekday_ridership', '08_monthly_weekend_ridership'],
):
    chart = vzu.plot_grouped_line_charts(
        dfs_by_month[day_type],
        pd.DataFrame({'date_ym': [pd.to_datetime('2022-10-01')]}),
        xvar="yearmonth(date_ym):T",
        yvar="avg_trips_per_station:Q",
        color_by_col="user_type:N",
        color_labels=['Annual', 'Casual'],
        color_values=['darkgreen', '#a1d99b'],
        legend=alt.Legend(
            titleFontSize=axis_label_fontsize,
            labelFontSize=axis_label_fontsize,
            labelColor='black',
            orient='bottom',
            direction='horizontal',
            titleOrient='left',
        ),
        annotation_text='Corporate Plan Starts' if day_type == 'weekday' else '',
        annotation_text_color='red',  # 'grey'
        annotation_text_opacity=0.5,
        annotation_text_x_loc=-175,
        annotation_text_y_loc=-95,
        xvar_rule="yearmonth(corporate_start):T",
        color_rule='red',  # '#cccccc'
        title_text=(
            'Inversion in Casual & Annual Patterns since Oct. 2021'
            if day_type == 'weekday'
            else 'Weekend Casual Ridership is Dominant since Oct. 2021'
        ),
        sub_title_text_color='#7f7f7f',
        sub_title_vertical_padding=5,
        axis_label_fontsize=14,
        x_axis_ticks=None,
        concat='column',
        fig_size=dict(width=450, height=300),
    )
    chart.save(os.path.join(figures_dir, f'{fname_no_ext}.png'))
    display(chart)
CPU times: user 557 ms, sys: 31.3 ms, total: 589 ms
Wall time: 499 ms

Observations

  1. A corporate bikeshare membership plan was offered in late September of 2021. Since then, average monthly Casual ridership has accounted for the majority of overall bike share ridership for several reasons
    • switch to hybrid working hours
    • use of bike share as an outdoor alternative to public transit to practice social-distancing
    • increased environmental awareness
    • improved cost flexibility (the corporate plan might partly account for this)
    • increased bicycle infrastructure (eg. bike lanes) introduced by the city of Toronto
      • 2022
      • 2023 (1, 2, 3, 4)
  2. At both types of stations, Casual ridership shows a year-over-year increase since the corporate plan was offered. By comparison, Annual monthly ridership is nearly unchanged at top performing stations. The same is true on weekdays and weekends. However, at other (non- top-performing) stations, Annual ridership has shown a downward trend down on weekdays between 2018 and 2022.
  3. Both types of members' ridership peak during the warmer months (middle) of the year at both types of stations on weekdays and weekends. This is the prime bike share season in Toronto. A wider peak is observed for Casual than for Annual ridership at both types of stations, suggesting that seasonal growth in Casual ridership starts earlier in the year and ends later in the year.
  4. On weekends, average Casual and Annual monthly ridership was nearly equal and relatively unchanged from 2019 to 2021. Annual ridership outpeformed Casual ridership on weekdays from 2018 to 2021. These were the distinguishing factors between Annual and Casual ridership during the four years between 2018 and 2021 inclusive. However, for weekdays and weekends in 2022, Casual monthly ridership has outperformed Annual ridership for every month since April. This is true for both types of stations.

Summary of Metrics Used

  1. average number of monthly departures (trips) per station, user type and station type for each year and month

User Behaviour Insights - Show Hourly Ridership By Station Type¶

Get the average hourly trips per station during 2018, per user type and station type using the following approach

  1. get hourly departures (trips) per station and user type during 2018
  2. combine (LEFT JOIN) with table indicating if table is a top-performer
  3. fill any missing values in the is_top_performer column after combining in step 4.
  4. get hourly departures (trips) per station type and user type and calculate average number of hourly trips per station by taking the ratio of the total trips to the total stations used per hour during 2018
In [35]:
%%time
query = f"""
        WITH t1 AS (
            -- 1. get number of hourly trips per station and user type in 2018
            SELECT start_station_id AS station_id,
                   started_at_year AS year,
                   started_at_hour AS hour,
                   REPLACE(user_type, ' Member', '') AS user_type,
                   COUNT(DISTINCT(trip_id)) AS trips
            FROM read_parquet({fpaths_proc[2018]})
            GROUP BY all
        ),
        -- 2. LEFT JOIN with table indicating if station is a top-performer
        t2 AS (
            SELECT *
            FROM t1
            LEFT JOIN (
                SELECT station_id,
                       is_top_perform_station
                FROM df_stations
            ) USING (station_id)
        ),
        -- 3. fill any missing values in the is_top_perform_station column after LEFT JOIN
        -- with FALSE
        t3 AS (
            SELECT --fill missing values in is_top_perform_station with FALSE
                   * EXCLUDE(is_top_perform_station),
                   COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station
            FROM t2
        ),
        -- 4. calculate average hourly trips per station, station type and user type
        t4 AS (
            SELECT -- change the top-performing station column from a boolean to a string
                   (
                       CASE WHEN is_top_perform_station = True
                       THEN 'top-performer'
                       ELSE 'others' END
                   ) AS station_type,
                   user_type,
                   year,
                   hour,
                   -- get number of stations
                   COUNT(DISTINCT(station_id)) AS num_stations_used,
                   strftime(make_timestamp(year, 1, 1, hour, 0, 0), '%Y %H') AS date_yh,
                   -- get trips (ridership)
                   SUM(trips) AS trips,
                   -- get average ridership
                   SUM(trips)/num_stations_used AS avg_trips_per_station,
                   '2018 08' AS peak_ridership
            FROM t3
            GROUP BY ALL
            ORDER BY ALL
        )
        SELECT *
        FROM t4
        """
df_by_hour_2018 = run_sql_query(query).convert_dtypes()
pu.show_df(df_by_hour_2018)
column station_type user_type year hour num_stations_used date_yh trips avg_trips_per_station peak_ridership
dtype string[python] string[python] Int32 Int32 Int64 string[python] Int64 Float64 string[python]
nunique 2 2 1 24 40 24 96 96 1
missing 0 0 0 0 0 0 0 0 0
0 others Annual 2018 0 253 2018 00 7813 30.881423 2018 08
1 others Annual 2018 1 245 2018 01 4788 19.542857 2018 08
2 others Annual 2018 2 238 2018 02 3294 13.840336 2018 08
3 others Annual 2018 3 217 2018 03 1486 6.847926 2018 08
4 others Annual 2018 4 198 2018 04 1449 7.318182 2018 08
... ... ... ... ... ... ... ... ... ...
91 top-performer Casual 2018 19 88 2018 19 10660 121.136364 2018 08
92 top-performer Casual 2018 20 87 2018 20 8190 94.137931 2018 08
93 top-performer Casual 2018 21 86 2018 21 6269 72.895349 2018 08
94 top-performer Casual 2018 22 86 2018 22 4612 53.627907 2018 08
95 top-performer Casual 2018 23 85 2018 23 3425 40.294118 2018 08

96 rows × 9 columns

CPU times: user 1.35 s, sys: 24.6 ms, total: 1.38 s
Wall time: 286 ms

Show above as a chart

In [36]:
%%time
chart = vzu.plot_grouped_line_charts(
    df_by_hour_2018,
    pd.DataFrame({'date_yh': ['2018 08']}),
    xvar="date_yh:N",
    yvar="avg_trips_per_station:Q",
    color_by_col="user_type:N",
    color_labels=['Annual', 'Casual'],
    color_values=['darkgreen', '#a1d99b'],
    legend=alt.Legend(
        titleFontSize=axis_label_fontsize,
        labelFontSize=axis_label_fontsize,
        labelColor='black',
        orient='bottom',
        direction='horizontal',
        titleOrient='left',
    ),
    annotation_text='Annual Peaks at 8AM, 5PM',
    annotation_text_color='red',  # 'grey'
    annotation_text_opacity=0.5,
    annotation_text_x_loc=60,
    annotation_text_y_loc=-85,
    xvar_rule="",
    color_rule='red',  # '#cccccc'
    title_text='Annual 2018 Trips Show Commuting-Driven Usage',
    sub_title_text_color='#7f7f7f',
    sub_title_vertical_padding=5,
    axis_label_fontsize=14,
    x_axis_ticks=df_by_hour_2018.iloc[::2, :]['date_yh'].tolist(),
    concat='column',
    fig_size=dict(width=450, height=300),
)
chart
CPU times: user 37.9 ms, sys: 0 ns, total: 37.9 ms
Wall time: 37.5 ms
Out[36]:

Observations

  1. Certain characteristics of average hourly ridership per station by Annual and Casual members are visible. Annual ridership shows stronger evidence of a commuter-driven pattern - the service is used more for commuting to school or work by Annual members than for leisurely activities by Casual members. This results in peaks in ridership during the morning and evening rush hours. The evening rush hour peak is stronger and wider (covers more hours) than the morning rush hour peak. There is also a much weaker peak during the middle of the day, around the lunch hour.
  2. Casual member ridership shows weak or no evidence of a morning peak but has a gradual increase to peak ridership in the evening. Peak ridership for Casual members is weaker by a factor of nearly 10. This suggests casual ridership has a leisurely pattern unlike the commuter-driven pattern of Annual ridership.
  3. The same commuter-driven and leisurely patterns are observed at both types of stations (top-performers and others).
  4. The trends seen here are typical of bike share ridership trends seen at other bike share networks as well.

Get the average hourly trips per station for each year in all available data per user type and station type, separately for weekdays and weekends

In [37]:
%%time
dfs_by_hour = {}
for day_type, dow in zip(['weekday', 'weekend'], [[0, 1, 2, 3, 4], [5, 6]]):
    dow_str = '('+', '.join([str(d) for d in dow])+')'
    query = f"""
            WITH t1 AS (
                -- 1. get number of hourly trips per station, user type and year in all data
                SELECT start_station_id AS station_id,
                       started_at_year AS year,
                       started_at_hour AS hour,
                       REPLACE(user_type, ' Member', '') AS user_type,
                       COUNT(DISTINCT(trip_id)) AS trips
                FROM read_parquet({fpaths_proc_all})
                WHERE ISODOW(started_at)-1 IN {dow_str}
                GROUP BY all
            ),
            -- 2. LEFT JOIN with table indicating if station is a top-performer
            t2 AS (
                SELECT *
                FROM t1
                LEFT JOIN (
                    SELECT station_id,
                           is_top_perform_station
                    FROM df_stations
                ) USING (station_id)
            ),
            -- 3. fill any missing values in the is_top_perform_station column after LEFT JOIN
            -- with FALSE
            t3 AS (
                SELECT --fill missing values in is_top_perform_station with FALSE
                       * EXCLUDE(is_top_perform_station),
                       COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station
                FROM t2
            ),
            -- 4. calculate average monthly trips per station, station type, user type and year
            t4 AS (
                SELECT -- change the top-performing station column from a boolean to a string
                       (
                           CASE WHEN is_top_perform_station = True
                           THEN 'top-performer'
                           ELSE 'others' END
                       ) AS station_type,
                       user_type,
                       year,
                       hour,
                       -- get number of stations
                       COUNT(DISTINCT(station_id)) AS num_stations_used,
                       strftime(make_timestamp(year, 1, 1, hour, 0, 0), '%Y %H') AS date_yh,
                       -- get trips (ridership)
                       SUM(trips) AS trips,
                       -- get average ridership
                       SUM(trips)/num_stations_used AS avg_trips_per_station,
                       '2021 00' AS corporate_start
                FROM t3
                GROUP BY ALL
                ORDER BY ALL
            )
            SELECT *
            FROM t4
            """
    dfs_by_hour[day_type] = run_sql_query(query).convert_dtypes()
pu.show_df(dfs_by_hour['weekday'])
column station_type user_type year hour num_stations_used date_yh trips avg_trips_per_station corporate_start
dtype string[python] string[python] Int32 Int32 Int64 string[python] Int64 Float64 string[python]
nunique 2 2 6 24 214 144 569 576 1
missing 0 0 0 0 0 0 0 0 0
0 others Annual 2018 0 242 2018 00 4451 18.392562 2021 00
1 others Annual 2018 1 214 2018 01 2205 10.303738 2021 00
2 others Annual 2018 2 199 2018 02 1398 7.025126 2021 00
3 others Annual 2018 3 165 2018 03 632 3.830303 2021 00
4 others Annual 2018 4 167 2018 04 1037 6.209581 2021 00
... ... ... ... ... ... ... ... ... ...
571 top-performer Casual 2023 19 96 2023 19 9601 100.010417 2021 00
572 top-performer Casual 2023 20 96 2023 20 7357 76.635417 2021 00
573 top-performer Casual 2023 21 96 2023 21 5729 59.677083 2021 00
574 top-performer Casual 2023 22 96 2023 22 4101 42.71875 2021 00
575 top-performer Casual 2023 23 94 2023 23 2706 28.787234 2021 00

576 rows × 9 columns

CPU times: user 18.4 s, sys: 648 ms, total: 19.1 s
Wall time: 1.97 s

Show above as a chart

In [38]:
%%time
for day_type, fname_no_ext in zip(
    ['weekday', 'weekend'],
    ['09_hourly_weekday_ridership', '10_hourly_weekend_ridership'],
):
    chart = vzu.plot_grouped_line_charts(
        dfs_by_hour[day_type],
        pd.DataFrame({'date_yh': ['2021 00']}),
        xvar="date_yh:N",
        yvar="avg_trips_per_station:Q",
        color_by_col="user_type:N",
        color_labels=['Annual', 'Casual'],
        color_values=['darkgreen', '#a1d99b'],
        legend=alt.Legend(
            titleFontSize=axis_label_fontsize,
            labelFontSize=axis_label_fontsize,
            labelColor='black',
            orient='bottom',
            direction='horizontal',
            titleOrient='left',
        ),
        annotation_text='Corporate Plan Starts' if day_type == 'weekday' else '',
        annotation_text_color='red',  # 'grey'
        annotation_text_opacity=0.5,
        annotation_text_x_loc=-80,
        annotation_text_y_loc=-115,
        xvar_rule="corporate_start:N",
        color_rule='red',  # '#cccccc'
        title_text=(
            'Since Oct. 2021, Casual Hourly Usage Dominates Annual'
            if day_type == 'weekday'
            else 'Weekend Casual Hourly Usage is 2X Annual since Oct. 2021'
        ),
        sub_title_text_color='#7f7f7f',
        sub_title_vertical_padding=5,
        axis_label_fontsize=14,
        x_axis_ticks=dfs_by_hour[day_type].iloc[::16, :]['date_yh'].tolist(),
        concat='column',
        fig_size=dict(width=450, height=300),
    )
    chart.save(os.path.join(figures_dir, f'{fname_no_ext}.png'))
    display(chart)
CPU times: user 800 ms, sys: 8.3 ms, total: 808 ms
Wall time: 612 ms

Observations

  1. Weekdays
    • For obvious reasons, the strong morning peak (centered at 8AM) in average hourly ridership per station by Annual members was much weaker as of 2020. The evening peak (centered at 4PM) remained. However, Casual ridership in 2022 is now showing the presence of a peak in morning and lunch hour ridership and a stronger and sharper peak in evening ridership than the peak seen in Annual ridership. This morning peak in Casual ridership was always present but weaker than it currently is. However, it first showed signs of growth in 2021. At both types of stations, Casual ridership now matches or exceeds average hourly ridership per station by Annual members. Through the emergence of this morning peak, Casual ridership has evolved to be a mix (hybrid) of previously observed Annual and Casual usage patterns. This is evidence of the emergence of hybrid workers who don't follow commuter-driven usage patterns that were previously seen in Annual member ridership alone.
  2. Weekends
    • A single wider and more gradual peak in Annual and Casual ridership is visible and is centered at 4PM. There is no peak in morning ridership at both types of stations.
    • For both types of users, average hourly weekend ridership was relatively unchanged from 2019 to 2021, in particular at the top-performing stations. However, in 2022, peak Casual hourly ridership has more than doubled Annual ridership. These findings reflect the impact of increased usage by Casual bike share members on hourly bike share ridership patterns in Toronto. Leisurely bike share ridership determines the pattern seen in weekend ridership on an hourly basis and Casual ridership is now dominant among overall bike share ridership on weekends at both types of stations.

Get the average hourly trips per station for each year and month in all available data for each user type and station type

In [39]:
%%time
dfs_by_month_hour = {}
for day_type, dow in zip(['weekday', 'weekend'], [[0, 1, 2, 3, 4], [5, 6]]):
    dow_str = '('+', '.join([str(d) for d in dow])+')'
    query = f"""
            WITH t1 AS (
                -- 1. get number of hourly trips per station, user type, year & month in all data
                SELECT start_station_id AS station_id,
                       started_at_year AS year,
                       started_at_month AS month,
                       started_at_hour AS hour,
                       REPLACE(user_type, ' Member', '') AS user_type,
                       COUNT(DISTINCT(trip_id)) AS trips
                FROM read_parquet({fpaths_proc[2020]+fpaths_proc[2021]+fpaths_proc[2022]})
                WHERE ISODOW(started_at)-1 IN {dow_str}
                GROUP BY all
            ),
            -- 2. LEFT JOIN with table indicating if station is a top-performer
            t2 AS (
                SELECT *
                FROM t1
                LEFT JOIN (
                    SELECT station_id,
                           is_top_perform_station
                    FROM df_stations
                ) USING (station_id)
            ),
            -- 3. fill any missing values in the is_top_perform_station column after LEFT JOIN
            -- with FALSE
            t3 AS (
                SELECT --fill missing values in is_top_perform_station with FALSE
                       * EXCLUDE(is_top_perform_station),
                       COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station
                FROM t2
            ),
            -- 4. calculate average hourly trips per station, station type, user type, year and month
            t4 AS (
                SELECT -- change the top-performing station column from a boolean to a string
                       (
                           CASE WHEN is_top_perform_station = True
                           THEN 'top-performer'
                           ELSE 'others' END
                       ) AS station_type,
                       user_type,
                       year,
                       month,
                       hour,
                       -- get number of stations
                       COUNT(DISTINCT(station_id)) AS num_stations_used,
                       strftime(make_timestamp(year, month, 1, hour, 0, 0), '%Y-%m %H') AS date_yh,
                       -- get trips (ridership)
                       SUM(trips) AS trips,
                       -- get average ridership
                       SUM(trips)/num_stations_used AS avg_trips_per_station,
                       '2021-10 00' AS corporate_start
                FROM t3
                GROUP BY ALL
                ORDER BY ALL
            )
            SELECT *
            FROM t4
            """
    dfs_by_month_hour[day_type] = run_sql_query(query).convert_dtypes()
pu.show_df(dfs_by_month_hour['weekday'])
column station_type user_type year month hour num_stations_used date_yh trips avg_trips_per_station corporate_start
dtype string[python] string[python] Int32 Int32 Int32 Int64 string[python] Int64 Float64 string[python]
nunique 2 2 3 12 24 498 864 2266 3177 1
missing 0 0 0 0 0 0 0 0 0 0
0 others Annual 2020 1 0 121 2020-01 00 229 1.892562 2021-10 00
1 others Annual 2020 1 1 91 2020-01 01 166 1.824176 2021-10 00
2 others Annual 2020 1 2 66 2020-01 02 103 1.560606 2021-10 00
3 others Annual 2020 1 3 36 2020-01 03 57 1.583333 2021-10 00
4 others Annual 2020 1 4 44 2020-01 04 102 2.318182 2021-10 00
... ... ... ... ... ... ... ... ... ... ...
3447 top-performer Casual 2022 12 19 94 2022-12 19 2418 25.723404 2021-10 00
3448 top-performer Casual 2022 12 20 94 2022-12 20 1820 19.361702 2021-10 00
3449 top-performer Casual 2022 12 21 94 2022-12 21 1437 15.287234 2021-10 00
3450 top-performer Casual 2022 12 22 94 2022-12 22 1124 11.957447 2021-10 00
3451 top-performer Casual 2022 12 23 86 2022-12 23 818 9.511628 2021-10 00

3452 rows × 10 columns

CPU times: user 14.8 s, sys: 379 ms, total: 15.2 s
Wall time: 1.67 s

Show above as a chart

In [40]:
%%time
for day_type in ['weekday', 'weekend']:
    chart = vzu.plot_grouped_line_charts(
        dfs_by_month_hour[day_type],
        pd.DataFrame({'date_yh': ['2021-10 00']}),
        xvar="date_yh:N",
        yvar="avg_trips_per_station:Q",
        color_by_col="user_type:N",
        color_labels=['Annual', 'Casual'],
        color_values=['darkgreen', '#a1d99b'],
        legend=alt.Legend(
            titleFontSize=axis_label_fontsize,
            labelFontSize=axis_label_fontsize,
            labelColor='black',
            orient='bottom',
            direction='horizontal',
            titleOrient='left',
        ),
        annotation_text='Corporate Plan Starts' if day_type == 'weekday' else '',
        annotation_text_color='red',  # 'grey'
        annotation_text_opacity=0.5,
        annotation_text_x_loc=-80,
        annotation_text_y_loc=-115,
        xvar_rule="corporate_start:N",
        color_rule='red',  # '#cccccc'
        title_text=(
            'Since Oct. 2021, Casual Hourly Usage Dominates Annual'
            if day_type == 'weekday'
            else 'Weekend Casual Hourly Usage is 2X Annual since Oct. 2021'
        ),
        sub_title_text_color='#7f7f7f',
        sub_title_vertical_padding=5,
        axis_label_fontsize=14,
        x_axis_ticks=(
            dfs_by_month_hour[day_type]
            .query("(station_type == 'others') & (user_type == 'Annual')")
            # show tick labels for month-hour combinations every three months
            .iloc[::24*3]
            ['date_yh']
            .tolist()
        ),
        concat='row',
        fig_size=dict(width=1_400, height=300),
    )
    display(chart)
CPU times: user 229 ms, sys: 12 ms, total: 241 ms
Wall time: 240 ms

Observations

  1. These plots more clearly shows the emergence of the stronger morning peak in average hourly ridership by Casual members immediately after the corporate plan was offered during October and November of 2021. Annual ridership was dominant again during December 2021 to March or April of 2022, before Casual ridership again dominated average hourly ridership per station across the network.
  2. The same patterns are observed at top-performing and other stations.

Summary of Metrics Used

  1. average number of hourly departures (trips) per station, user type and station type for each
    • year and month
    • year

Show Ridership By Day of the Week and Station Type¶

Get the average trips per station by day of the week for each year in all available data per user type and station type using the following approach

  1. get weekday departures (trips) per station, user type and year in all available data (2018 to 2023)
  2. combine (LEFT JOIN) with table indicating if table is a top-performer
  3. fill any missing values in the is_top_performer column after combining in step 2.
  4. get monthly departures (trips) per station type and user type and calculate average departures (trips) per station for each day of the week by taking the ratio of the total trips to the total stations used per day of the week
In [41]:
%%time
query = f"""
        WITH t1 AS (
            -- 1. get number of weekday trips per station, user type and year in all data
            -- using ID
            SELECT start_station_id AS station_id,
                   started_at_year AS year,
                   DAYNAME(started_at) AS weekday,
                   REPLACE(user_type, ' Member', '') AS user_type,
                   COUNT(DISTINCT(trip_id)) AS trips
            FROM read_parquet({fpaths_proc_all})
            GROUP BY all
        ),
        -- 2. LEFT JOIN with table indicating if station is a top-performer
        t2 AS (
            SELECT *
            FROM t1
            LEFT JOIN (
                SELECT station_id,
                       is_top_perform_station
                FROM df_stations
            ) USING (station_id)
        ),
        -- 3. fill any missing values in the is_top_perform_station column after LEFT JOIN
        -- with FALSE
        t3 AS (
            SELECT --fill missing values in is_top_perform_station with FALSE
                   * EXCLUDE(is_top_perform_station),
                   COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station
            FROM t2
        ),
        -- 4. calculate average weekday trips per station, station type, user type and year
        t4 AS (
            SELECT -- change the top-performing station column from a boolean to a string
                   (
                       CASE WHEN is_top_perform_station = True
                       THEN 'top-performer'
                       ELSE 'others' END
                   ) AS station_type,
                   user_type,
                   year,
                   weekday,
                   COUNT(DISTINCT(station_id)) AS num_stations_used,
                   SUM(trips) AS trips,
                   SUM(trips)/num_stations_used AS avg_trips_per_station
            FROM t3
            GROUP BY ALL
            ORDER BY ALL
        )
        SELECT *
        FROM t4
        """
df_by_day_of_week = run_sql_query(query).convert_dtypes()
pu.show_df(df_by_day_of_week.head())
column station_type user_type year weekday num_stations_used trips avg_trips_per_station
dtype string[python] string[python] Int32 string[python] Int64 Int64 Float64
nunique 1 1 1 5 1 5 5
missing 0 0 0 0 0 0 0
0 others Annual 2018 Friday 270 136756 506.503704
1 others Annual 2018 Monday 270 121718 450.807407
2 others Annual 2018 Saturday 270 84152 311.674074
3 others Annual 2018 Sunday 270 77576 287.318519
4 others Annual 2018 Thursday 270 147048 544.622222
CPU times: user 15 s, sys: 785 ms, total: 15.8 s
Wall time: 1.64 s

Show above as a chart

In [42]:
%%time
weekday_colors = [
    'lightgreen', '#39a055', 'black', '#208943', 'darkgreen', '#abdda5', '#c0e6ba'
]
chart = vzu.plot_bar_chart_array(
    df_by_day_of_week,
    xvar = 'weekday:N',
    yvar = "sum(avg_trips_per_station):Q",
    color_by_col = "weekday:N",
    color_labels = {'Annual': list(day_name), 'Casual': list(day_name)},
    color_values = {"Annual": weekday_colors, 'Casual': weekday_colors},
    legend=alt.Legend(
        titleFontSize=axis_label_fontsize,
        labelFontSize=axis_label_fontsize,
        labelColor='black',
        orient='bottom',
        direction='horizontal',
        titleOrient='left',
    ),
    title_text={
        'Annual': (
            'Weekday Ridership by Annual Members outside Top-Performers is '
            'Contracting'
        ),
        'Casual': 'Weekday Ridership by Casual Members is Growing Across Entire Network',
    },
    column_var = "year:N",
    column_spacing = 5,
    column_label_color = 'grey',
    column_sort=list(day_name),
    column_label_position = 'bottom',
    sub_title_text_color="#7f7f7f",
    sub_title_vertical_padding=5,
    axis_label_fontsize=axis_label_fontsize,
    concat='column',
    column_label_align='center',
    fig_size=dict(width=105, height=300),
)
chart['Annual'].save(os.path.join(figures_dir, '11_daily_weekday.png'))
chart['Casual'].save(os.path.join(figures_dir, '12_daily_weekend.png'))
display(chart['Annual'])
display(chart['Casual'])
CPU times: user 305 ms, sys: 0 ns, total: 305 ms
Wall time: 238 ms

Notes

  1. For each year, the bars are ordered chronologically by day of the week, with Monday on the far left and Sunday on the far right.
  2. The four darker colors highlight the days of the week (Tuesday, Wednesday, Thursday and Friday) with highest average ridership per station by top-performing Annual members during 2018 and 2019.

Observations

  1. Trend
    • Average ridership by day of the week per station by Annual members is showing a weak (top-performing) or more pronounced (others) drop since 2020. By comparison, the corresponding ridership was relatively unchanged from 2018 to 2019 at both types of stations.
    • Average ridership by day of the week per station by Casual members has grown at both types of stations with the strongest growth qualitatively seen in 2022.
    • Together, these trends are similar to the observations seen earlier by month
  2. Intra-week Seasonality
    • For Casual members
      • average weekday (Mon - Fri) ridership per station has been less than average weekend ridership since 2018
      • average ridership per station increases per day of the week from Monday to Saturday, before dropping on Sunday
    • For Annual members
      • average weekend (Mon - Fri) ridership per station has been lower than average weekday ridership since 2018
      • average ridersip per station reaches its maximum on Wednesday or Thursday since 2020
  3. Other
    • the emergence of dominance of Casual ridership during 2022 is also evident in daily average ridership per station: since 2022, average Casual member ridership per station is lowest on Monday but this is still larger than the highest average Annual member ridership per station on any day of the week.

Summary of Metrics Used

  1. average number of departures (trips) per station by day of the week, user type and station type for each year.

Show Relationship Between Temperature and Ridership By Station Type¶

Get the average daily trips per station for each year and month in all available data per user type and station type, combined with the maximum daily temperature in the city

  1. get daily departures (trips) per station, user type, year and month in all available data (2018 to 2023)
  2. combine with table indicating if table is a top-performer
  3. fill any missing values in the is_top_performer column after combining in step 2.
  4. get daily departures (trips) per station type and user type and calculate average number of daily departures (trips) per station for each day by taking the ratio of total daily trips to the total stations used per day
  5. combine with maximum daily temperature from weather data
In [43]:
%%time
query = f"""
        WITH t1 AS (
            -- 1. get number of daily trips per station, user type, year and month in all data
            SELECT start_station_id AS station_id,
                   started_at_year AS year,
                   started_at_month AS month,
                   started_at_day AS day,
                   REPLACE(user_type, ' Member', '') AS user_type,
                   COUNT(DISTINCT(trip_id)) AS trips
            FROM read_parquet({fpaths_proc_all})
            GROUP BY all
        ),
        -- 2. LEFT JOIN with table indicating if station is a top-performer
        t2 AS (
            SELECT *
            FROM t1
            LEFT JOIN (
                SELECT station_id,
                       is_top_perform_station
                FROM df_stations
            ) USING (station_id)
        ),
        -- 3. fill any missing values in the is_top_perform_station column after LEFT JOIN
        -- with FALSE
        t3 AS (
            SELECT --fill missing values in is_top_perform_station with FALSE
                   * EXCLUDE(is_top_perform_station),
                   COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station
            FROM t2
        ),
        -- 4. calculate average daily trips per station, station type, user type, year and month
        t4 AS (
            SELECT -- change the top-performing station column from a boolean to a string
                   (
                       CASE WHEN is_top_perform_station = True
                       THEN 'top-performer'
                       ELSE 'others' END
                   ) AS station_type,
                   user_type,
                   year,
                   month,
                   day,
                   make_date(year, month, day) AS date,
                   COUNT(DISTINCT(station_id)) AS num_stations_used,
                   SUM(trips) AS trips,
                   SUM(trips)/num_stations_used AS avg_trips_per_station
            FROM t3
            GROUP BY ALL
            ORDER BY ALL
        ),
        -- 5. LEFT JOIN with daily weather data
        t5 AS (
            SELECT *,
                   (
                       CASE
                           WHEN year = 2018 THEN 19
                           WHEN year = 2019 THEN 17
                           WHEN year = 2020 THEN 17
                           WHEN year = 2021 THEN 15
                       ELSE 2.5
                       END
                   ) AS x_turn_point,
                   0.75 AS y_turn_point,
                   -- append indicator symbol for presence (2018, 2019) or absence (2020, 2021, 2022)
                   -- of turning point in ridership based on temperature
                   '↑' AS mark_turn_point
            FROM t4
            LEFT JOIN (
                SELECT time as date,
                       tmin,
                       tmax
                FROM read_parquet({[fpath_weather]})
            ) USING (date)
        )
        SELECT *,
               make_date(2021, 10, 1) AS corporate_start
        FROM t5
        """
df_daily = run_sql_query(query).convert_dtypes()
pu.show_df(df_daily)
column station_type user_type year month day date num_stations_used trips avg_trips_per_station tmin tmax x_turn_point y_turn_point mark_turn_point corporate_start
dtype string[python] string[python] Int32 Int32 Int32 datetime64[us] Int64 Int64 Float64 Float64 Float64 Float64 Float64 string[python] datetime64[us]
nunique 2 2 6 12 31 1916 524 3895 6974 404 440 4 1 1 1
missing 0 0 0 0 0 0 0 0 0 4 4 0 0 0 0
0 top-performer Annual 2018 1 1 2018-01-01 49 109 2.22449 -21.3 -8.7 19.0 0.75 ↑ 2021-10-01
1 top-performer Annual 2018 1 2 2018-01-02 71 427 6.014085 -13.1 -7.8 19.0 0.75 ↑ 2021-10-01
2 top-performer Annual 2018 1 4 2018-01-04 74 530 7.162162 -20.5 -8.9 19.0 0.75 ↑ 2021-10-01
3 top-performer Annual 2018 1 5 2018-01-05 70 383 5.471429 -23.0 -15.0 19.0 0.75 ↑ 2021-10-01
4 top-performer Annual 2018 1 6 2018-01-06 62 194 3.129032 -23.5 -16.4 19.0 0.75 ↑ 2021-10-01
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7652 others Annual 2021 10 19 2021-10-19 445 3245 7.292135 4.1 21.1 15.0 0.75 ↑ 2021-10-01
7653 others Annual 2022 6 29 2022-06-29 433 2489 5.748268 14.4 25.2 2.5 0.75 ↑ 2021-10-01
7654 others Annual 2022 10 18 2022-10-18 432 2742 6.347222 0.9 9.4 2.5 0.75 ↑ 2021-10-01
7655 others Annual 2023 2 3 2023-02-03 225 521 2.315556 -21.0 -10.0 2.5 0.75 ↑ 2021-10-01
7656 others Annual 2023 3 12 2023-03-12 257 527 2.050584 <NA> <NA> 2.5 0.75 ↑ 2021-10-01

7657 rows × 15 columns

CPU times: user 18 s, sys: 624 ms, total: 18.6 s
Wall time: 1.94 s

Notes

  1. HTML arrow symbols are used as indicators.

Show above as a chart

In [44]:
%%time
chart = vzu.plot_scatter_chart_grid(
    df_daily.query("year.isin(@years_temp_dependence)"),
    xvar = 'tmax:Q',
    yvar = 'avg_trips_per_station:Q',
    row_var = "year:O",
    color_by_col = 'user_type:N',
    col_var = "station_type:N",
    row_sort=years_temp_dependence,
    facet_label_color='grey',
    color_labels = ['Casual', 'Annual'],
    color_values = ['darkgreen', '#a1d99b'],
    legend=alt.Legend(
        title=None,
        titleFontSize=axis_label_fontsize,
        labelFontSize=axis_label_fontsize,
        labelColor='black',
        orient='bottom',
        direction='horizontal',
        titleOrient='left',
    ),
    y_scale='linear',
    ptitle=alt.TitleParams(
        (
            'Casual Ridership Shows Increased Resilience to Colder '
            'Temps. & Now Resembles Annual Ridership'
        ),
        anchor='start',
        dx=55,
        dy=-5,
        fontSize=axis_label_fontsize,
    ),
    marker_size=60,
    annotation_text_color = 'red',
    annotation_text_x_loc_var = 'x_turn_point:Q',
    annotation_text_y_loc_var = 'y_turn_point:Q',
    annotation_text_color_by_col = 'mark_turn_point:N',
    axis_label_fontsize = 14,
    fig_size=dict(width=375, height=300),
)
chart.save(
    os.path.join(
        figures_dir, '13_daily_max_temperature_vs_ridership.png'
    )
)
chart
CPU times: user 2.23 s, sys: 84.5 ms, total: 2.32 s
Wall time: 1.97 s
Out[44]:

Notes

  1. An inflection point is the temperature at which users decide whether or not to use the bike share service (i.e. whether or not to take a trip using bike share).
  2. The y-axis is shown on a log-scale in order to better visualize changes in ridership as temperature is increased.
  3. Each of the eight subplots is displaying 365 points for Annual members and 365 points for Casual members.

Observations

  1. 2018 and 2019
    • Annual is more dispersed, especially at lower temperatures.
    • Annual ridership is higher than casual, as was also seen in monthly, hourly and day-of-the-week charts.
    • Casual ridership shows an inflection point that was visually estimated and is shown with a red arrow for both station types.
    • A Casual ridership cluster (higher concentration of points) appears at bottom left (near the inflection point) and at the top right (centered at ~30C).
    • For Annual ridership, a similar cluster appears at top right only.
    • Top-performing stations show faster growth in Casual ridership above the inflection point. We can see this from the stronger slope in the dark green points before and after the inflection point.
  2. 2020, 2021, 2022
    • The Casual ridership scatter of points is less dispersed.
    • The Casual and Annual ridership scatter of points strongly overlap with each other.
    • Casual ridership shows an inflection point that is reducing in terms of temperature. Ridership is becoming more resilient to lower maximum daily temperatures.
    • For Annual and Casual ridership, a cluster appears at the top right only.
  3. Shape of Casual ridership plot for top-performers in 2022 is 180-deg rotated compared to 2019.
  4. From 2019 to 2022, the separation between the temperature dependence of Casual and Annual ridership has reduced until it was eventually almost absent. In 2022, the temperature dependences for both types of users show strong similarities to each other.

Summary of Metrics Used

  1. average number of daily departures (trips) per station, user type and station type for each year and month

Get the average daily trips per station for each year and month in all available data per station type overall (across both user types), combined with the maximum daily temperature in the city

  1. get daily departures (trips) per station, year and month in all available data (2018 to 2023)
  2. combine with table indicating if table is a top-performer
  3. fill any missing values in the is_top_performer column after combining in step 2.
  4. get daily departures (trips) per station type and calculate average number of daily departures (trips) per station for each day by taking the ratio of total daily trips to the total stations used per day
  5. combine with min and max daily temperature from weather data
In [45]:
%%time
query = f"""
        WITH t1 AS (
            -- 1. get number of daily trips per station, user type, year and month in all data
            SELECT start_station_id AS station_id,
                   started_at_year AS year,
                   started_at_month AS month,
                   started_at_day AS day,
                   COUNT(DISTINCT(trip_id)) AS trips
            FROM read_parquet({fpaths_proc_all})
            WHERE started_at_year <= 2022
            GROUP BY all
        ),
        -- 2. LEFT JOIN with table indicating if station is a top-performer
        t2 AS (
            SELECT *
            FROM t1
            LEFT JOIN (
                SELECT station_id,
                       is_top_perform_station
                FROM df_stations
            ) USING (station_id)
        ),
        -- 3. fill any missing values in the is_top_perform_station column after LEFT JOIN
        -- with FALSE
        t3 AS (
            SELECT --fill missing values in is_top_perform_station with FALSE
                   * EXCLUDE(is_top_perform_station),
                   COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station
            FROM t2
        ),
        -- 4. calculate average daily trips per station, station type, year and month
        t4 AS (
            SELECT -- change the top-performing station column from a boolean to a string
                   (
                       CASE WHEN is_top_perform_station = True
                       THEN 'top-performer'
                       ELSE 'others' END
                   ) AS station_type,
                   year,
                   month,
                   day,
                   make_date(year, month, day) AS date,
                   COUNT(DISTINCT(station_id)) AS num_stations_used,
                   SUM(trips) AS trips,
                   SUM(trips)/num_stations_used AS avg_trips_per_station
            FROM t3
            GROUP BY ALL
            ORDER BY ALL
        ),
        -- 5. LEFT JOIN with minimum and maximum temperature in daily weather data
        t5 AS (
            SELECT *
            FROM t4
            LEFT JOIN (
                SELECT time as date,
                       MIN(tmin) AS tmin,
                       MAX(tmax) AS tmax
                FROM read_parquet({[fpath_weather]})
                GROUP BY ALL
            ) USING (date)
        )
        SELECT *,
               make_date(2021, 10, 1) AS corporate_start
        FROM t5
        ORDER BY station_type, date
        """
df_daily_overall = run_sql_query(query).convert_dtypes()
df_daily_overall = (
    df_daily_overall
    .assign(
        tmin=lambda df: df['tmin'].ffill(),
        tmax=lambda df: df['tmax'].ffill(),
    )
)
pu.show_df(df_daily_overall)
column station_type year month day date num_stations_used trips avg_trips_per_station tmin tmax corporate_start
dtype string[python] Int32 Int32 Int32 datetime64[us] Int64 Int64 Float64 Float64 Float64 datetime64[us]
nunique 2 5 12 31 1826 415 2951 3611 401 439 1
missing 0 0 0 0 0 0 0 0 0 0 0
0 others 2018 1 1 2018-01-01 83 128 1.542169 -21.3 -8.7 2021-10-01
1 others 2018 1 2 2018-01-02 143 515 3.601399 -13.1 -7.8 2021-10-01
2 others 2018 1 3 2018-01-03 139 622 4.47482 -13.5 -6.3 2021-10-01
3 others 2018 1 4 2018-01-04 141 619 4.390071 -20.5 -8.9 2021-10-01
4 others 2018 1 5 2018-01-05 123 393 3.195122 -23.0 -15.0 2021-10-01
... ... ... ... ... ... ... ... ... ... ... ...
3647 top-performer 2022 12 27 2022-12-27 92 1244 13.521739 -7.0 -2.0 2021-10-01
3648 top-performer 2022 12 28 2022-12-28 94 1827 19.43617 -2.0 5.0 2021-10-01
3649 top-performer 2022 12 29 2022-12-29 96 2352 24.5 2.0 9.2 2021-10-01
3650 top-performer 2022 12 30 2022-12-30 96 2120 22.083333 8.0 13.0 2021-10-01
3651 top-performer 2022 12 31 2022-12-31 95 1130 11.894737 4.0 10.0 2021-10-01

3652 rows × 11 columns

CPU times: user 11.3 s, sys: 547 ms, total: 11.9 s
Wall time: 1.23 s

Show above as a chart

In [46]:
%%time
chart = vzu.plot_multi_axis_line_chart_grid(
    df_daily_overall,
    xvar='yearmonthdate(date):T',
    line_colors={'top-performer': 'darkgreen', 'others': '#31a354'},
    y1var='avg_trips_per_station:Q',
    y2var_min='tmin:Q',
    y2var_max='tmax:Q',
    y_title_text='Daily Ridership per Station',
    y2_title_text='Temperature Range (C)',
    title_text='Total Daily Ridership is Maximized During Warmest Months',
    sub_title_text_color="#7f7f7f",
    sub_title_vertical_padding=5,
    shading_opacity=0.85,
    shading_color='lightgrey',
    axis_label_fontsize=14,
    fig_size=dict(width=700, height=360),
)
chart.save(
    os.path.join(
        figures_dir, '14_daily_max_temperature_and_ridership.png'
    )
)
chart
CPU times: user 413 ms, sys: 12.6 ms, total: 426 ms
Wall time: 346 ms
Out[46]:

Use R2 to calculate the yearly correlation between daily overall ridership and maximum daily temperature

In [47]:
df_daily_overall_r2 = (
    df_daily_overall
    .groupby(['station_type', 'year'], as_index=False)
    .apply(lambda df: linregress(df['trips'], df['tmax'])[2], include_groups=False)
    .rename(columns={None: 'y'})
)
df_daily_overall_r2
Out[47]:
station_type year y
0 others 2018 0.848727
1 others 2019 0.877885
2 others 2020 0.786010
3 others 2021 0.846501
4 others 2022 0.898041
5 top-performer 2018 0.874282
6 top-performer 2019 0.889999
7 top-performer 2020 0.793471
8 top-performer 2021 0.845074
9 top-performer 2022 0.896529

Show above as a heatmap

In [48]:
%%time
chart = vzu.plot_simple_heatmap(
    df_daily_overall_r2,
    xvar = 'year:O',
    yvar = 'station_type:O',
    color_by_col = 'y:Q',
    color_scale = 'linear',
    annot_color_threshold = 0.8,
    grid_linewidth = 0.75,
    ptitle_text = 'Daily Trips are Correlated to Max. Daily Temperature',
    axis_label_fontsize=14,
    fig_size = dict(width=510, height=115),
)
chart.save(
    os.path.join(
        figures_dir,
        '15_daily_max_temperature_ridership_correlation.png'
    )
)
chart
CPU times: user 71.2 ms, sys: 402 µs, total: 71.6 ms
Wall time: 56.3 ms
Out[48]:

Notes

  1. Darker shades of red indicate stronger correlation. Lighter shades of red indicate weaker correlation. Yellow indicates the weakest correlation.

Observations

  1. The line chart of average daily bike share ridership per station and station type temperature range shows the same pattern of a mid-year peak in daily ridership that was previously seen in monthly ridership. The heatmap shows that there is a strong correlation between daily ridership and maximum daily temperature. This can be inferred from the above point (1.) and other ridership patterns seen above. These are not surprising and are seen at both types of stations.

Summary of Metrics Used

  1. average number of daily departures (trips) per station and station type for each year and month

Recommendations¶

Based on insights into the temporal bike share patterns, it is recommended to focus on displaying ads on the faces of bike share stations during hours of the day, days of the week and months of the year during which average historical ridership per station was at its peak. Since the peaks are not spikes (or delta functions) and have some non-zero width, a window before and after the observed peak is required.

In terms of station attributes, the Regular station configuration and stations that accept credit card payments capture the majority (at least 80%) of top-performing stations. So, it not necessary to filter stations based on these two attributes.

In terms of overall station performance, it was found that stations which are top-performers overall and on both weekedays and weekends account for approximately 80% of top-performing stations. Both groupings of stations will be conisdered so a filter to eliminate one of thes egroupings is not required.

Below are the recommendations based on temporal insights

In [49]:
recommended_hour_filters = {
    "weekday_prime": (
        "day_of_week IN ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')"
        "AND month IN ('May', 'June', 'July', 'August', 'September', 'October')"
        "AND hour IN (7, 8, 9, 15, 16, 17, 18)"
        "AND user_type IN ('Annual', 'Casual')"
    ),
    "weekend_prime": (
        "day_of_week IN ('Saturday', 'Sunday')"
        "AND month IN ('May', 'June', 'July', 'August', 'September', 'October')"
        "AND hour IN (13, 14, 15,16, 17, 18)"
        "AND user_type IN ('Casual')"
    ),
    "weekday_offseason": (
        "day_of_week IN ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')"
        "AND month IN ('November', 'December')"
        "AND hour IN (15, 16, 17)"
        "AND user_type IN ('Annual', 'Casual')"
    ),
}
df_temporal_recommends = (
    pd.DataFrame.from_dict(recommended_hour_filters, orient='index')
    .transpose()
    .convert_dtypes()
)
with pd.option_context('display.max_colwidth', None):
    pu.show_df(df_temporal_recommends)
column weekday_prime weekend_prime weekday_offseason
dtype string[python] string[python] string[python]
nunique 1 1 1
missing 0 0 0
0 day_of_week IN ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')AND month IN ('May', 'June', 'July', 'August', 'September', 'October')AND hour IN (7, 8, 9, 15, 16, 17, 18)AND user_type IN ('Annual', 'Casual') day_of_week IN ('Saturday', 'Sunday')AND month IN ('May', 'June', 'July', 'August', 'September', 'October')AND hour IN (13, 14, 15,16, 17, 18)AND user_type IN ('Casual') day_of_week IN ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')AND month IN ('November', 'December')AND hour IN (15, 16, 17)AND user_type IN ('Annual', 'Casual')

Export temporal recommendations to disk

In [50]:
%%time
fname_prefix = "recommendations_temporal"
_ = df_temporal_recommends.pipe(
    flut.load,
    processed_data_dir,
    fname_prefix,
    my_timezone,
    True,
)
Exported 1 rows of recommendations_temporal data to /home/jovyan/data/processed/recommendations_temporal__20240322_093954.parquet.gzip
CPU times: user 17.5 ms, sys: 84 µs, total: 17.5 ms
Wall time: 17.2 ms

Discussion¶

Conclusions¶

  1. Since the corporate plan was introduced in late September of 2021, a (behavioural) change in the trend of bike share ridership has occurred. The factors that have contributed to this change in behaviour and that are likely to continue being present moving forward are
    • a switch to hybrid working hours (which effects both types of users)
    • increased Casual ridership
    • increased environmental awareness
    • improved cost flexibility (the corporate plan might partly account for this)
    • increased bicycle infrastructure (eg. bike lanes) introduced by the city of Toronto
  2. On a monthly basis, bike share ridership starts increasing in April or May each year, reaches a peak during the middle of each year (July) and then decreases with the strong dropoff after the Canadian Thanksgiving weekend (on October 9, 2023). The same is true for weekdays and weekends. On weekdays, average monthly ridership per station by Annual members has stayed relatively unchanged since 2018 at top-performing stations but is showing a decreasing trend at other stations. Since the corporate plan was offered, Casual ridership now accounts for the majority of average monthly bike share ridership per station on both weekdays and weekends.
  3. On an hourly basis, weekday Annual ridership is characterized by peaks during the morning and evening commuting hours (8AM and 4PM). The morning peak was much weaker for Casual users from 2018 to 2021. In 2022, this peak sharply increased for Casual users. Peak morning and evening ridership by Casual users now outperforms the same by Annual users. Weekend ridership is characterized by a single weaker and broader peak centered at 4PM. Here, Casual ridership more than doubled Annual hourly ridership in 2022 after being generally similar for the previous two years. These patterns are consistent between top-performing and other stations on weekends. These charts suggest commuter-driven ridership on weekdays and more leisurely usage on weekends. In summary, weekday hourly ridership patterns were previously driven by Annual users. However, in 2022, the rise of Casual ridership now means that Annual and Casual members show a similar hourly profile in terms of average ridership per station reflecting hybrid working hours. The rise of Casual ridership seen on a monthly basis is also seen on an hourly basis on weekends, where it is now the dominant contributor to weekend ridership.
  4. On an intra-week basis, average ridership per station by day of the week by Annual members is showing a weak (at top-performing stations) or more pronounced (at other stations) downward trend since 2020. By comparison, ridership by Casual members has only shown year-over-year growth since 2018. Daily Annual ridership is characterized by stronger demand during the middle of the week (Wednesday or Thursday) than on weekends since 2018. This is contrasted by Casual ridership which increases during the week Monday and reaches a maximum on Saturday, with stronger demand on weekends (Saturday and Sunday) than on weekdays (Monday to Friday) since 2018. In 2022, the lowest average Casual member daily ridership per station is larger than the highest average daily Annual member ridership per station on all days of the week. These patterns are also consistent between top-performing and other stations. This reinforces the observation that Casual ridership is increasing across the network, which was seen prevoius by month and hour.
  5. There is evidence of increased resilience of Casual ridership to colder daily temperatures. Casual ridership grows at a sharper rate above a cutoff at both types of stations and this cutoff has been qualitatively shifting towards colder maximum daily temperatures. Also, as of 2022, the maximum daily temperature dependence profiles for Annual and Casual members' daily ridership, which were divergent during 2019, now overlap with each other. Finally, it is not surprising that daily ridership is observed to be correlated to maximum daily tempererature.

Summary of Assumptions¶

  1. Same as in data retrieval and processing step.

Version Information¶

In [51]:
packages = [
    'scipy',
    'pandas',
    'pyarrow',
    'duckdb',
    'altair',
    'vl-convert',
]
print(
    watermark(
        updated=True,
        current_date=True,
        current_time=True,
        timezone=True,
        custom_time="%Y-%m-%d %H:%M:%S %Z",
        python=True,
        machine=True,
        packages=','.join(packages),
    )
)
Last updated: 2024-03-22 13:39:54 UTC

Python implementation: CPython
Python version       : 3.11.8
IPython version      : 8.22.2

scipy     : 1.12.0
pandas    : 2.2.1
pyarrow   : 15.0.1
duckdb    : 0.10.0
altair    : 5.2.0
vl-convert: not installed

Compiler    : GCC 12.3.0
OS          : Linux
Release     : 6.6.10-76060610-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 12
Architecture: 64bit